Reputation: 933
I have a text that I have tokenized, or in general a list of words is ok as well. For example:
>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
If I have a Python dict that contains single word as well as multi-word keys, how can I efficiently and correctly check for their presence in the text? The ideal output would be key:location_in_text pairs, or something as convenient. Thanks in advance!
P.S. To explain "correctly" - If I have "lease" in my dict, I do not wish Please marked. Also, recognizing plurals is required. I am wondering if this can be elegantly solved without many if-else clauses.
Upvotes: 3
Views: 3212
Reputation: 122280
If you already have a list of Multi-Word Expressions gazetteers, you can use MWETokenizer
, e.g.:
>>> from nltk.tokenize import MWETokenizer
>>> from nltk import sent_tokenize, word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... ... two of them.\n\nThanks.'''
>>> mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')
>>> [mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New_York', '.'], ['Please', 'buy', 'me', '...', 'two', 'of', 'them', '.'], ['Thanks', '.']]
Upvotes: 5