Anita
Anita

Reputation: 285

part of string matches dictionary key string

I have a string homemade green tea powder and a dictionary dict = {'green tea': 'FLAVOR', 'banana': 'FLAVOR', 'homemade': 'CLAIM'}

My question is that how can I map part of the string to the dictionary keys and then get the corresponding values. To further explain, "green tea" is in string and it is one of the keys in the dictionary. Same as the homemade term. I want to get the result like this:

[('homemade', 'CLAIM'), ('green tea', 'FLAVOR'), ('powder', 'NOUN')]

I'm thinking about taking adjacent words into account. Can I do ngram mapping? If I look at three, two, and one words in a string, so it would be homemade green tea, green tea powder, homemade green, green tea, tea powder, homemade, green, tea, powder. Then, I can try to check those ngram terms whether they are in the dictionary keys or not.

My current code:

from nltk.tag import pos_tag, map_tag
def get_pos_tup(string):
  lst=[]
  for word in string.split():
    if word in dict.keys():
        lst.append((word, dict[word]))
    else:
        for word, tag in pos_tag(word_tokenize(word)):
            lst.append((word, map_tag('en-ptb', 'universal', tag))) 
  return lst 

My result is: [('homemade', 'NOUN'), ('green', 'ADJ'), ('tea', 'NOUN'), ('powder', 'NOUN')]

Upvotes: 1

Views: 75

Answers (1)

blhsing
blhsing

Reputation: 106553

You can join the keys of the dict to form an alternation regex pattern so that you can use re.findall to find all the matching keywords and map them to their values in a list comprehension:

import re
d = {'green tea': 'FLAVOR', 'banana': 'FLAVOR', 'homemade': 'CLAIM', 'powder': 'NOUN'}
s = 'homemade green tea powder'
print([(k, d[k]) for k in re.findall(r'\b(?:%s)\b' % '|'.join(map(re.escape, d)), s)])

This outputs:

[('homemade', 'CLAIM'), ('green tea', 'FLAVOR'), ('powder', 'NOUN')]

If you want to be able to handle keywords that are possibly sub-sequences of other keywords, you should sort the keywords by the number of words in reverse order first:

import re
d = {'green tea': 'FLAVOR', 'banana': 'FLAVOR', 'homemade': 'CLAIM', 'powder': 'NOUN', 'green': 'COLOR'}
s = 'green homemade green tea powder'
print([(k, d[k]) for k in re.findall(r'\b(?:%s)\b' % '|'.join(map(re.escape, sorted(d, key=lambda w: -w.count(' ')))), s)])

This outputs:

[('green', 'COLOR'), ('homemade', 'CLAIM'), ('green tea', 'FLAVOR'), ('powder', 'NOUN')]

Upvotes: 1

Related Questions