AdeDoyle
AdeDoyle

Reputation: 395

What is the best way to compare strings to find matching words in Python?

I have two texts, text A and text B. Text B isn't an exact copy of text A, it has a lot of special characters which aren't in text A, but it is technically the same text. I need to compare the strings and map the counterparts in Text B to their counterparts in Text A.

The text isn't in English, and can't easily be translated to English, so the examples below are just to demonstrate a few of the problems.

Some words in text A are not in text B, but all words in text B should be in Text A:

text_a = "he experienced déjà vu"
text_b = ['he', 'experienced']

Some words in text B use different characters to text A, but are the same words:

text_a = "she owns & runs the cafe florae"
text_b = ['she', 'owns', 'and', 'runs', 'the', 'cefé', 'floræ']

Words in text B are generally in the right order, but not always:

text_a = "an uneasy alliance"
text_b = ['uneasy', 'alliance', 'an']

Some words in text B are made up of smaller components, which are also included in text B, and these smaller components are unnecessary:

text_a = "we should withdraw our claim"
text_b = ['we', 'should', 'with', 'draw', 'withdraw', 'our', 'claim']

Some words in text A are represented by two or more words in text B:

text_a = "they undercut their competitors"
text_b = ['they', 'under', 'cut', 'their', 'competitors']

What I want to do is replace the words in text A with their counterparts from text B. To do this I need to write a function to match words between the two texts.

I've tried writing a function that compares strings using the edit distance method from the nltk library in conjunction with a handful of RegEx. This only does an OK job, so I looked at using sequence alignment techniques from libraries like biopython, but I can't get my head around these.

In particular, while using edit distance it's very difficult to match words like 'under' and 'cut' to 'undercut', while also avoiding errors in short strings. This is because in a sentence containing similar tokens, like 'to' and 'tu', these tokens have the same edit distance from something like 'tú', and would theoretically be equally valid candidates, though the obvious match here would be 'tu', not 'to'.

Is there any highly accurate way to match strings from text B in text A? I'd like to get an output like:

text_a = "the cafe florae undercut their competitors then withdrew their claim"
text_b = ['the', 'café', 'floræ', 'under', 'cut', 'their', 'competitors', 'then',
          'with', 'drew', 'withdrew', 'their', 'claim']

match_list = some_matchfunc(text_a, text_b)

print(match_list)

[['the', 'the'], ['cafe', 'café'], ['florae', 'floræ'], ['undercut', 'under'],
 ['undercut', 'cut'], ['their', 'their'], ['competitors', 'competitors'], ['then', 'then'],
 ['withdrew', 'withdrew'], ['their', 'their'], ['claim', 'claim']]

Ideally, this would also include the index for the beginning and end of each matched word in text A, to avoid confusion, like with the word "their" which occurs twice below:

[['the', [0, 3] 'the'], ['cafe', [4, 8] 'café'], ['florae', [9, 15] 'floræ'],
 ['undercut', [16, 24], 'under'], ['undercut', [16, 24], 'cut'], ['their', [25, 30], 'their'],
 ['competitors', [31, 42], 'competitors'], ['then', [43, 47], 'then'], ['withdrew', [48, 56], 'withdrew'],
 ['their', [57, 62], 'their'], ['claim', [63, 68], 'claim']]

As mentioned above, the text isn't in English, and translating it to compare words using NLP techniques isn't really feasible, so it does need to be based on string comparison. I think there must be some method or library already in existence which employs a more efficient sequence alignment algorithm than I can come up with using RegEx and edit distance, but I can't find one.

Does anybody know of a highly accurate method for comparing strings to achieve this outcome?

Upvotes: 0

Views: 1394

Answers (1)

Mateo Torres
Mateo Torres

Reputation: 1625

The problem itself is very complex, and I would suggest a combination of dictionaries with suitable synonyms when suitable, and then falling back to a sequence alignment approach. The implementations in biopython are probably not really suitable for this case (BLAST, for instance, relies in a score matrix that does not make sense for real words, only to nucleotide or amino acid sequences). I suggest you have a look at SequenceMatcher, which could do the job. A very simple (albeit naive) solution is to do a pairwise alignment of all the candidates and pick the closest match. Depending on the complexity of the alignment, e.g. whether gaps/replacements are required (imagine "they're" -> "they are").

Bear in mind that in some cases many-to-many, one-to-many, and many-to-one substitutions will be required (you have some of these in your example already). This is not automatically solved by the sequence alignment, hence my suggestion to use a dictionary (a bidirectional one if can afford it). If the synonym corpus is very big, I would even consider a database for such tasks.

Also, some of your examples require word-level substitutions and some require letter-level substitutions. I suggest that you handle these separately. If you don't have to deal with typos, I would start with the bigger (word) scale, and then move on to the letter-level substitutions.

Upvotes: 1

Related Questions