Reputation: 17139
I'm dealing with a large database which have two columns. The first column id
is a long
while second column name
is a String
. name
is the name of a person with corresponding id. So, I wish to compare the name
of row with name
of other rows.
John Carter
john Carter
Carter
jo car
Willam Carter
C William
Carter j.
All these name
s in rows should provide matches. If possible it would be great to have the percentage/ratio of match. Is there any java library/snippet that can do this? I'm open to all suggestions.
Upvotes: 2
Views: 2379
Reputation: 4697
Have a look at the paper 'A Comparison of String Distance Metrics for Name-Matching Tasks' of William W. Cohen et al. The paper compares several string distance metrics.
They also implemented the most of them within the SecondString project. It is a "open-source Java-based package of approximate string-matching techniques" so you could easily compare the different metrics to evaluate which of them fits your requirements.
If you just need to match names - Jaro-Winkler is a good choice, which is also implemented within the SecondString package.
If you have all your names in a database, it may makes sense to implement the similarity measure as stored procedure to avoid fetching all the data to compare them using java. So you could use queries like this:
SELECT t1.name, t2.name, sim(t1.name, t2.name) FROM table t1, table t2 WHERE sim(t1.name, t2.name) > 0.8
Upvotes: 0
Reputation: 5846
This library could be interesting for you: http://sourceforge.net/projects/simmetrics/
It provides different similarity measures for Strings.
From their SourceForge page:
SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman).
Upvotes: 4
Reputation: 40903
Looks like you'll be interested in the Levenshtein algorithm for computing string distances. You can find a Java implementation here.
Upvotes: 4