user9758331
user9758331

Reputation:

How to choose a fuzzy matching algorithm?

I need to know the criteria which made fuzzy algo different from each other between those 3 :

Levenshtein distance Algorithm

Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Damerau–Levenshtein distance

Damerau–Levenshtein distance is a distance (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.

Bitap algorithm with modifications by Wu and Manber

Bitmap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal.

My document is a table with name of companies, some companies are twice or three time because of misspelling. In this particular case, how to group the companies by matching them ? Which algorithm to chose and why ? In the file I have 100k lines and it is growing.

Upvotes: 3

Views: 4255

Answers (2)

user9937931
user9937931

Reputation:

If you are on Google Sheets, try using Flookup. You say that your list has over 100K rows so that might prove a bit of a challenge depending on Google's (timed) execution limits but I still encourage you to give it a try. One function you might be interested in is:

FLOOKUP(lookupValue, tableArray, lookupCol, indexNum, [threshold], [rank])

Full Disclosure: I created Flookup for Google sheets

Upvotes: 2

James Smith
James Smith

Reputation: 1

If you want to group the companies you could use locality sensitive hashing or a clustering method such as K-medoids clustering with e.g., Levenshtein edit distance as the metric. Alternatively, you can use SymSpell.

Levenshtein- and Damerau–Levenshtein distance are both good metrics for string similarity, but make sure that you use a fast implementation. There are too many popular and insanely slow implementations on Github. The best I know of are PolyLeven or editdistance.

Upvotes: 0

Related Questions