Reputation: 497
I have a list of tuples that are generated from a string using NLTK's PoS tagger.
I'm trying to find the the "intent" of a specific string in order to append it to a dataframe, so I need a way to generate a syntax/grammar rule.
string = "RED WHITE AND BLUE"
string_list = nltk.pos_tag(a.split())
string_list = [('RED', 'JJ'), ('WHITE', 'NNP'), ('AND', 'NNP'), ('BLUE', 'NNP')]
The strings vary in size, from 2-3 elements all the way to full on paragraphs (40-50+) so I'm wondering if there is a general form or rule that I can create in order to parse a sentence.
So if I want find a pattern in a list an example pseudocode output would be:
string_pattern = "I want to kill all the bad guys in the Halo Game"
pattern = ('I', 'PRP') + ('want', 'VBP') + ('to', 'TO') + ('kill:', 'JJ') + ('all', 'DT') + ('bad', 'JJ') + ('guys', 'NNS') + ('in', 'IN') + ('Halo', 'NN') + ('Game', 'NN')
Ideally I would be able to match part of the pattern in a tagged string, so it finds:
('I', 'PRP') + ('want', 'VBP') + ('to', 'TO') + ('kill:', 'JJ')
but it doesn't need the rest, or vice versa it can find multiple examples of the pattern in the same string, in the event that the string is a paragraph. If anyone knows the best way to do this or knows a better alternative it would be really helpful!
Upvotes: 0
Views: 91
Reputation: 2079
The simplest method I can think of is using brute force (sure, you could adapt it or even use some machine learning to help find classes for easier matching).
A simple bruteforce method follows:
Tag the String
string_list = nltk.pos_tag(a.split())
Create a list of expected tags
pos_tags = ["NN", "VBP", "NN"]
The following function will be able to check wheter this pattern appears:
def find_match(string_list, pos_tags)
num_matched = 0
match_start_pos = 0
matched = False
#Enumerating gives you an index to compare to enable you to find where matching starts
for idx, tuple in enumerate(string_list):
if tuple[1] == pos_tags[num_matched]:
num_matched += 1
if num_matched == 0:
match_start_pos = idx
else:
num_matched = 0
if num_matched == len(pos_tags):
matched = True
break
return (matched, match_start_pos)
More Realistically:
Now, more practically, Suppose you belong to a Civilian protection agency and want to be aware of any tweet made by school students mentioning killing. You somehow filter the tweets and want to check if someone wants to kill anyone else.
With just a little modification, you can achieve at something similar (the following ideas are somehow powered by what is called Frame Semantics):
killing_intent_dict = {"PRP":set("I", "YOU", "He", "She"), "V": set("kill"), "NNP":set("All", "him", "her")}
if find_match_pattern(string_list, killing_intent_dict):
# someone wants to kill! Call 911
def find_match_pattern(string_list, pattern_dict)
num_matched = 0
match_start_pos = 0
matched = False
#Enumerating gives you an index to compare to enable you to find where matching starts
for idx, tuple in enumerate(string_list):
if tuple[1] == pattern_dict.keys()[num_matched]:
if tuple[0] in pattern_dict[tuple[1]]:
num_matched += 1
if num_matched == 0:
match_start_pos = idx
else:
num_matched = 0
else:
num_matched = 0
if num_matched == len(pattern_dict):
matched = True
break
return (matched, match_start_pos)
Keep in mind that this is all experimental and requires a lot of hand coding. You can add to it NER tags so you can abstract names.
Appending another possibility, similar to the one I used in my master's research:
Instead of using a linear bruteforce mechanism, you could create a graph containing the actions, agents and intents, connecting them all. You then use some sort of graph spreading algorithm while your program reads the input. You can read more in my research, but keep in mind that this topic that you are asking (Natural Language Understanding) is deep and under development: https://drive.google.com/open?id=12gWLx2saFe5mZI96roUG_p1YfzrqVNbx
Upvotes: 1