Reputation: 804
I am looking for a method to compare string similarity. Specifically, given two addresses I would like a measure of their similarity.
E.G.
Given 8219 Lime Forest Blvd and 8219 Lime Forst Boulevard
The output of the comparison should give me an idea of how similar the strings are.
Upvotes: 1
Views: 96
Reputation: 22084
Levenshtein distance is way to go. Just out of box idea - two addresses can be different a lot (one can be postal code, another one street with number) and a lot of money were spend to create awesome geocoding services (like https://developers.google.com/maps/documentation/geocoding/?hl=cs). So alternative approach would be to calculate longitude/latitude for both addresses via geocoding service and see if the latitude/longitude matches :)
Upvotes: 4
Reputation: 6580
you could use something like this
import org.apache.commons.lang.StringUtils;
public class StringComparison {
/**
* @param args
*/
public static void main(String[] args) {
String s1 = "8219 Lime Forest Blvd";
String s2 = "8219 Lime Forst Boulevard";
//number of chars that differ
int distance = StringUtils.getLevenshteinDistance(s1, s2);
//"relative" difference
float d = (float)distance / (float)s1.length();
System.out.println(d);
}
}
getLevenshteinDistance will give you a number of chars that differ from s1 to s2.
I think it's more useful if you divide this number by the string length (careful with division by zero) and try to manually find a sweet spot where the difference is small enough to detect the same address (for me, this is usually around 20~30%)
This example is in JAVA, the lib used is at http://commons.apache.org/proper/commons-lang/index.html
Also, you could improve this just replacing known abbreviations and trying with them too.
Upvotes: 2