Efficient Name Disambiguation for Large-Scale Databases
Sun Dec 21, 2008
338 Words
I've been interested for a while now in entity disambiguation,
particularly where the e
ntities are the names of authors in academic journals.
Jian Huang, Seyda Ertekin and C Lee Giles have a paper from 2006 in
which they describe
the method that they used to diambiguate the CiteSeer data set of
over 700,000 article
s. The paper is "Efficient Name Disambiguation for Large-Scale Databases".
They were able to use this algorithim to disambiguate this data set
over three days int
o just under half a million unique authors (though I didn't see a
mention of the hardware nor of whether they used a linear or parallel
computing approach).
Their approach seems to be to create an online SVM to bootstap a
distance funtion. This
distance funtion can be trained using a number of types of
information, names, meta da
ta such as emails, and terms extracted from the associated papers.
They then block author names into groups based on name similarity, use
the distance fun
vtion found with the SVM and find groups of names associated with the
same person by sc
anning over the data using DBSCAN, which is a clustering algorithm
that creates cluster
s based on a minimal distance and minimal number of members. By
slicing up the paramate
r space based on minimal distance, rather than on an a-priori number
of clusters, the a
lgorithim is insensitive to a change in the number of points in the
parameter space. Th
is means you can use this method in an iterative way and it can be
adopted to new data
as it arrives. I'm remined of some papers in astrophysics that did
clustering based on
voroni volumes, but only in so far as the voroni method is vaguley
related to a density
method.
All in all this looks like a nice approach to the problem, and the
authors got 90% accu
racy with their method, which is probably enough to bootstap a
solution into existence.
Read and post comments |
Send to a friend