Reputation: 981
main_text is a list of lists containing sentences that've been part-of-speech tagged:
main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('likes','VB'),
('tea','NN'), ('and','CC'), ('hats', 'NN')], [('the', 'DT'), ('red','JJ')
('queen', 'NN'), ('hates','VB'),('alice','NN')]]
ngrams_to_match is a list of lists containing part-of-speech tagged trigrams:
ngrams_to_match = [[('likes','VB'),('tea','NN'), ('and','CC')],
[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')],
[('hates', 'DT'), ('alice', 'JJ'), ('but', 'CC') ],
[('and', 'CC'), ('the', 'DT'), ('rabbit', 'NN')]]
(a) For each sentence in main_text, first check to see if a complete trigram in ngrams_to _match matches. If the trigram matches, return the matched trigram and the sentence.
(b) Then, check to see if the the first tuple (a unigram) or the first two tuples (a bigram) of each of the trigrams match in main_text.
(c) If the unigram or bigram forms a substring of an already matched trigram, don't return anything. Otherwise, return the bigram or unigram match and the sentence.
Here is what the output should be:
trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
bigram_match = [('hates', 'DT'), ('alice', JJ')], sentence[1]
Condition (b) gives us the bigram_match.
The WRONG output would be:
trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
bigram_match = [('the', 'DT'), ('mad', 'JJ')] #*bad by condition c
unigram_match = [ [('the', 'DT')] #*bad by condition c
trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
bigram_match = [('likes','VB'),('tea','NN')] #*bad by condition c
unigram_match [('likes', 'VB')]# *bad by condition c
and so on.
The following, very ugly code works okay for this toy example. But I was wondering if anyone had a more streamlined approach.
for ngram in ngrams_to_match:
for sentence in main_text:
for tup in sentence:
#we can't be sure that our part-of-speech tagger will
#tag an ngram word and a main_text word the same way, so
#we match the word in the tuple, not the whole tuple
if ngram[0][0] == tup[0]: #if word in the first ngram matches...
unigram_index = sentence.index(tup) #...then this is our index
unigram = (sentence[unigram_index][0]) #save it as a unigram
try:
if sentence[unigram_index+2][0]==ngram[2][0]:
if sentence[unigram_index+2][0]==ngram[2][0]: #match a trigram
trigram = (sentence[unigram_index][0],span[1][0], ngram[2][0])#save the match
print 'heres the trigram-->', sentence,'\n', 'trigram--->',trigram
except IndexError:
pass
if ngram[0][0] == tup[0]:# == tup[0]: #same as above
unigram_index = sentence.index(tup)
if sentence[unigram_index+1][0]==span[1][0]: #get bigram match
bigram = (sentence[unigram_index][0],span[1][0])#save the match
if bigram[0] and bigram[1] in trigram: #no substring matches
pass
else:
print 'heres a sentence-->', sentence,'\n', 'bigram--->', bigram
if unigram in bigram or trigram: #no substring matches
pass
else:
print unigram
Upvotes: 4
Views: 6733
Reputation: 38247
I've had a stab at implementing this using a generator. I found some gaps in your spec, so I've made assumptions.
If the unigram or bigram forms a substring of an already matched trigram, don't return anything. - Is a bit ambiguous about which gram is referring to the search elements or the matched elements. Makes me start to hate the use of the N-gram
words (which I'd never heard of before last week).
Play with what gets added to the found
set in order to modify excluded search elements.
# assumptions:
# - [('hates','DT'),('alice','JJ'),('but','CC')] is typoed and should be:
# [('hates','VB'),('alice','NN'),('but','CC')]
# - matches can't overlap, matched elements are excluded from further checking
# - bigrams precede unigrams
main_text = [
[('the','DT'),('mad','JJ'),('hatter','NN'),('likes','VB'),('tea','NN'),('and','CC'),('hats','NN')],
[('the','DT'),('red','JJ'),('queen','NN'),('hates','VB'),('alice','NN')]
]
ngrams_to_match = [
[('likes','VB'),('tea','NN'),('and','CC')],
[('the','DT'),('mad','JJ'),('hatter','NN')],
[('hates','VB'),('alice','NN'),('but','CC')],
[('and','CC'),('the','DT'),('rabbit','NN')]
]
def slice_generator(sentence,size=3):
"""
Generate slices through the sentence in decreasing sized windows. If True is sent to the
generator, the elements from the previous window will be excluded from future slices.
"""
sent = list(sentence)
for c in range(size,0,-1):
for i in range(len(sent)):
slice = tuple(sent[i:i+c])
if all(x is not None for x in slice) and len(slice) == c:
used = yield slice
if used:
sent[i:i+size] = [None] * c
def gram_search(text,matches):
tri_bi_uni = set(tuple(x) for x in matches) | set(tuple(x[:2]) for x in matches) | set(tuple(x[:1]) for x in matches)
found = set()
for i, sentence in enumerate(text):
gen = slice_generator(sentence)
send = None
try:
while True:
row = gen.send(send)
if row in tri_bi_uni - found:
send = True
found |= set(tuple(row[:x]) for x in range(1,len(row)))
print "%s_gram_match, sentence[%s] = %r" % (len(row),i,row)
else:
send = False
except StopIteration:
pass
gram_search(main_text,ngrams_to_match)
Yields:
3_gram_match, sentence[0] = (('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')) 3_gram_match, sentence[0] = (('likes', 'VB'), ('tea', 'NN'), ('and', 'CC')) 2_gram_match, sentence[1] = (('hates', 'VB'), ('alice', 'NN'))
Upvotes: 1