jparram
jparram

Reputation: 804

Statistical String Comparison

I am looking for a method to compare string similarity. Specifically, given two addresses I would like a measure of their similarity.

E.G.

Given 8219 Lime Forest Blvd and 8219 Lime Forst Boulevard

The output of the comparison should give me an idea of how similar the strings are.

Upvotes: 1

Views: 96

Answers (2)

Ondrej Svejdar
Ondrej Svejdar

Reputation: 22084

Levenshtein distance is way to go. Just out of box idea - two addresses can be different a lot (one can be postal code, another one street with number) and a lot of money were spend to create awesome geocoding services (like https://developers.google.com/maps/documentation/geocoding/?hl=cs). So alternative approach would be to calculate longitude/latitude for both addresses via geocoding service and see if the latitude/longitude matches :)

Upvotes: 4

Leo
Leo

Reputation: 6580

you could use something like this

import org.apache.commons.lang.StringUtils;


public class StringComparison {

    /**
     * @param args
     */
    public static void main(String[] args) {

        String s1 = "8219 Lime Forest Blvd";
        String s2 = "8219 Lime Forst Boulevard";

        //number of chars that differ
        int distance = StringUtils.getLevenshteinDistance(s1, s2);

        //"relative" difference
        float d = (float)distance / (float)s1.length();

        System.out.println(d);

    }

}

getLevenshteinDistance will give you a number of chars that differ from s1 to s2.

I think it's more useful if you divide this number by the string length (careful with division by zero) and try to manually find a sweet spot where the difference is small enough to detect the same address (for me, this is usually around 20~30%)

This example is in JAVA, the lib used is at http://commons.apache.org/proper/commons-lang/index.html

Also, you could improve this just replacing known abbreviations and trying with them too.

Upvotes: 2

Related Questions