Levenshtein distance Python UDF as fuzzy matching proxy in SQL join

Question

I came across a forum post that describes a method of creating a Python UDF in Redshift: https://community.periscopedata.com/r/y715m2.

More info about Python UDFs in Redshift: https://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html

I checked a number of outputs by the function (like select public.levenshtein('walk', 'cake'))- and it works quite well.

I am hoping to use this concept for fuzzy matching in joins between two tables on t1.first_name+last_name = t2.first_name+last_name.

Is anyone familiar with a "magical range" (or can suggest something from experience) in which a record should fall between to be deemed a likely match? ie. what should the min and max levenshtein (s,t) be to be considered a likely match.

AlexYes · Accepted Answer

It depends on your particular case. Think of it as a simple machine learning problem when you provide a training dataset - you can run the function against your data to see the values for different kinds of pairs and set your range based on that. If you're matching names the cost of error is quite high, both for false negative (no match for the same person) and false positive (match for different people) cases, so I would go with soundex rather than leuvenstein. AFAIK Leuvenstein distance would be equal to one for very different last names if they are different only in one letter but this can be two cases - when last names are actually the same but spelled differently, or when the last names are actually different but the difference is one letter. Soundex is better for distinguishing such cases.

Levenshtein distance Python UDF as fuzzy matching proxy in SQL join

Answers (1)

Related Questions