Binoy Babu
Binoy Babu

Reputation: 17139

Comparing/matching Strings in java

I'm dealing with a large database which have two columns. The first column id is a long while second column name is a String. name is the name of a person with corresponding id. So, I wish to compare the name of row with name of other rows.

John Carter
john Carter
Carter
jo car
Willam Carter
C William
Carter j.

All these names in rows should provide matches. If possible it would be great to have the percentage/ratio of match. Is there any java library/snippet that can do this? I'm open to all suggestions.

Upvotes: 2

Views: 2379

Answers (3)

aiolos
aiolos

Reputation: 4697

Have a look at the paper 'A Comparison of String Distance Metrics for Name-Matching Tasks' of William W. Cohen et al. The paper compares several string distance metrics.

They also implemented the most of them within the SecondString project. It is a "open-source Java-based package of approximate string-matching techniques" so you could easily compare the different metrics to evaluate which of them fits your requirements.

If you just need to match names - Jaro-Winkler is a good choice, which is also implemented within the SecondString package.

If you have all your names in a database, it may makes sense to implement the similarity measure as stored procedure to avoid fetching all the data to compare them using java. So you could use queries like this:

SELECT t1.name, t2.name, sim(t1.name, t2.name) FROM table t1, table t2 WHERE sim(t1.name, t2.name) > 0.8

Upvotes: 0

Apfelsaft
Apfelsaft

Reputation: 5846

This library could be interesting for you: http://sourceforge.net/projects/simmetrics/

It provides different similarity measures for Strings.

From their SourceForge page:

SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman).

Upvotes: 4

Dunes
Dunes

Reputation: 40903

Looks like you'll be interested in the Levenshtein algorithm for computing string distances. You can find a Java implementation here.

Upvotes: 4

Related Questions