Rod0n
Rod0n

Reputation: 1079

location extraction with fuzzy matching capabilities

I've a bids database with a "bid location" field that contains the input from manual workers.

I'm using a list of buenos aires streets as the corpus:

av. de mayo
av. del libertador
av. diaz velez

And some of the bid locations fields contain the following text:

of. de compras hosp. c. durand (diaz velez 5044) c.a.b.a
av. de mayo 525, planta baja, oficina 11, ciudad de buenos aires
oficina de compras - av. diaz velez 5044 - cap. fed. -

I'm reading the Python Text Processing with NLTK book because it has a "Location Extraction" section that I implemented. The problem with this code is that the match must be perfect between some sentence from the corpus and from some slice of the input text.

As you can see in the examples I gave above, "diaz velez" won't be a match because the corpus contains "av. diaz velez".

I thought about using fuzzy matching but I don't think is a good solution because the program will have to compare a lot of strings in order to get the response.

I've been researching and didn't find an example or similar solution to this problem so I ask you if you have some pointers for me to follow because I'm quite lost with this.

Upvotes: 0

Views: 1121

Answers (2)

eldams
eldams

Reputation: 750

To overcome this "exact match" limitation, many, many solutions are possible. As a first step, you can maybe desing a backoff strategy : as long as you find exact matches in texts, everything's fine. For remaining parts of text, you can try to use approximate string matching.

Personaly, I'd suggest a simple approach to starts with. For any string that remains unannotated, try to gradually gather clues to see if parts of those strings (in fact, tokens lists) can be related to knwon entries.

For instance:

  1. Once exact matches have been found, break down unannotated string as lists of ngrams.
  2. Using bags of words (remove stop words), compute the cosine distance of any ngram to your known list of locations : a threshold will allow to select possible candidates.
  3. For this limited set of candidates, compute an edit distance of the text to rank them.

I believe that you're moving from deterministic to non-deterministic matching, which indeed has advantages and drawbacks. Statistics will definetely help. You may also consider transforming your list of locations into a grammar, which would make some tokens optional (e.g. "av.").

Upvotes: 1

stachyra
stachyra

Reputation: 4603

This problem might be a really good place for you to attempt a novel use of some classic biological sequence alignment algorithms such as Smith-Waterman or Needleman-Wunsch. Although these algorithms do involve many different string comparisons, which you had said that you hoped to avoid, nevertheless, they still manage to achieve good speed and efficiency by employing specialized dynamic programming techniques in order to take advantage of certain internal redundancies that are often found in these types of "fuzzy matching" problems.

One difference between your problem, vs. the typical biological fuzzy string matching problems that Smith-Waterman and Needleman-Wunsch were originally designed to solve, lies in the number of symbols which comprise the allowed "alphabet". In biological sequence matching problems, the "alphabet" typically consists either of four DNA or RNA nucleotides (adenine, cytosine, guanine, thymine (DNA) or uracil (RNA), traditionally notated as A, C, G, T or U) or else, depending upon the problem domain, between 20 and 22 amino acids. For your problem, OTOH, you'll probably want to use the full 26 letters of the roman alphabet (52 if you decide to make a distinction between upper and lower case) plus some other symbols such as periods, commas, apostrophes, parentheses, digits 0-9, etc.

Another thing to be aware of: biological sequence alignment algorithms typically have a semi-heuristic scoring system: e.g., you assign a certain number of positive points for each matching symbol, a certain number of negative points for the existence of a "gap" in the alignment (which could come into play in your case, if the bid location list happens to use a period or comma in a place name, while the corpus omits it), and then perhaps you assign a somewhat smaller number of extra negative points if the gap is 2 or 3 characters long instead of only 1 character, etc. The way that you assign these points is somewhat arbitrary, and you may need to tune the operation of the algorithm a bit in order to get an overall outcome which in the majority of cases usually looks sensible to you.

The good news is that once you do all of that, the algorithm will return a "match score" between each line in the bid location list and each line in the location corpus, and you can simply select the highest scoring match.

Upvotes: 0

Related Questions