Reputation: 362
Beginner here:
I have a block of text:
For example: 'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'
and a list of words: ['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']
My end goal is to find from the list of words which string matches/fuzzy matches from the block of text.
What have I tried: difflib.get_close_matches
Output Required: 'angiotensin enzyme serum'
, 'angiotensin enzyme a1'
Output order isnt a concern.
For other blocks of text, some other string from list would match. Block isnt constant.
Is there a way to achieve this?
Upvotes: 0
Views: 359
Reputation: 13079
Using fuzzywuzzy
(from PyPi):
from fuzzywuzzy import fuzz
text = 'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'
words = ['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']
matches = [w for w in words if fuzz.partial_ratio(text, w) > 70.]
Obviously you will want to adjust the threshold value to suit, but the values are well separated in this example:
>>> print(matches)
['angiotensin enzyme serum', 'angiotensin enzyme a1']
>>> for w in words:
... print(w, fuzz.partial_ratio(text, w))
...
angiotensin enzyme serum 83
some diff enzyme 56
angiotensin enzyme a1 90
Upvotes: 2