Reputation: 285
I have a string homemade green tea powder
and a dictionary dict = {'green tea': 'FLAVOR', 'banana': 'FLAVOR', 'homemade': 'CLAIM'}
My question is that how can I map part of the string to the dictionary keys and then get the corresponding values. To further explain, "green tea" is in string and it is one of the keys in the dictionary. Same as the homemade
term. I want to get the result like this:
[('homemade', 'CLAIM'), ('green tea', 'FLAVOR'), ('powder', 'NOUN')]
I'm thinking about taking adjacent words into account. Can I do ngram mapping? If I look at three, two, and one words in a string, so it would be homemade green tea
, green tea powder
, homemade green
, green tea
, tea powder
, homemade
, green
, tea
, powder
. Then, I can try to check those ngram terms whether they are in the dictionary keys or not.
My current code:
from nltk.tag import pos_tag, map_tag
def get_pos_tup(string):
lst=[]
for word in string.split():
if word in dict.keys():
lst.append((word, dict[word]))
else:
for word, tag in pos_tag(word_tokenize(word)):
lst.append((word, map_tag('en-ptb', 'universal', tag)))
return lst
My result is: [('homemade', 'NOUN'), ('green', 'ADJ'), ('tea', 'NOUN'), ('powder', 'NOUN')]
Upvotes: 1
Views: 75
Reputation: 106553
You can join the keys of the dict to form an alternation regex pattern so that you can use re.findall
to find all the matching keywords and map them to their values in a list comprehension:
import re
d = {'green tea': 'FLAVOR', 'banana': 'FLAVOR', 'homemade': 'CLAIM', 'powder': 'NOUN'}
s = 'homemade green tea powder'
print([(k, d[k]) for k in re.findall(r'\b(?:%s)\b' % '|'.join(map(re.escape, d)), s)])
This outputs:
[('homemade', 'CLAIM'), ('green tea', 'FLAVOR'), ('powder', 'NOUN')]
If you want to be able to handle keywords that are possibly sub-sequences of other keywords, you should sort the keywords by the number of words in reverse order first:
import re
d = {'green tea': 'FLAVOR', 'banana': 'FLAVOR', 'homemade': 'CLAIM', 'powder': 'NOUN', 'green': 'COLOR'}
s = 'green homemade green tea powder'
print([(k, d[k]) for k in re.findall(r'\b(?:%s)\b' % '|'.join(map(re.escape, sorted(d, key=lambda w: -w.count(' ')))), s)])
This outputs:
[('green', 'COLOR'), ('homemade', 'CLAIM'), ('green tea', 'FLAVOR'), ('powder', 'NOUN')]
Upvotes: 1