Efficient Name Disambiguation for Large-Scale Databases

Sun Dec 21, 2008

338 Words

Tags:

I've been interested for a while now in entity disambiguation,

particularly where the e

ntities are the names of authors in academic journals.

Jian Huang, Seyda Ertekin and C Lee Giles have a paper from 2006 in

which they describe

the method that they used to diambiguate the CiteSeer data set of

over 700,000 article

s. The paper is "Efficient Name Disambiguation for Large-Scale Databases".

They were able to use this algorithim to disambiguate this data set

over three days int

o just under half a million unique authors (though I didn't see a

mention of the hardware nor of whether they used a linear or parallel

computing approach).

Their approach seems to be to create an online SVM to bootstap a

distance funtion. This

distance funtion can be trained using a number of types of

information, names, meta da

ta such as emails, and terms extracted from the associated papers.

They then block author names into groups based on name similarity, use

the distance fun

vtion found with the SVM and find groups of names associated with the

same person by sc

anning over the data using DBSCAN, which is a clustering algorithm

that creates cluster

s based on a minimal distance and minimal number of members. By

slicing up the paramate

r space based on minimal distance, rather than on an a-priori number

of clusters, the a

lgorithim is insensitive to a change in the number of points in the

parameter space. Th

is means you can use this method in an iterative way and it can be

adopted to new data

as it arrives. I'm remined of some papers in astrophysics that did

clustering based on

voroni volumes, but only in so far as the voroni method is vaguley

related to a density

method.

All in all this looks like a nice approach to the problem, and the

authors got 90% accu

racy with their method, which is probably enough to bootstap a

solution into existence.

Read and post comments |
Send to a friend

This work is licensed under a Creative Commons Attribution 4.0 International License