How do I extract patterns from lists of POS tagged words? NLTK

Question

I have a text file that contains multiple lists; each list contains tuples of word/pos-tag pairs, like so:

    [('reviewtext', 'IN'), ('this', 'DT'), ('movie', 'NN'), ('was', 'VBD'), ('great', 'JJ'), ('and', 'CC'), ('fun', 'NN'), ('i', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('this', 'DT'), ('awesome', 'NN'), ('movie', 'NN')]
    [('reviewtext', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('fun', 'VBN'), ('but', 'CC'), ('long', 'RB')]
    [('reviewtext', 'IN'), ('i', 'PRP'), ('loved', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('movie', 'NN'), ('my', 'PRP$'), ('brother', 'NN'), ('got', 'VBD'), ('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ'), ('at', 'IN'), ('the', 'DT'), ('end', 'NN')]

I need to extract all adjective-conjunction-adjective pairs, or all JJ-CC-JJ pairs (the words only, and not the pos tags). For this example, the final output should be a list containing all the patterns:

    ['great and fun', 'sad and unhappy']

I used the following code to tag the text:

    with open("C:\Users\M\Desktop\sample dataset.txt") as fileobject:
        for line in fileobject:
            line = line.lower() #lowercase
            line = re.sub(r'[^\w\s]','',line) #remove punctuation
            line = nltk.word_tokenize(line) #tokenize
            line = nltk.pos_tag(line) #POS tag

            fo = open("C:\Users\M\Desktop\movies1_complete.txt", "a")
            fo.write(str(line))
            fo.write("
")
            fo.close()

But how do I extract the words in the above mentioned patters? I checked here and here, but they do not explain how to extract specific pos patterns. Thanks in advance.

Padraic Cunningham · Accepted Answer

from itertools import islice

for sub in l:
    for a, b, c in zip(islice(sub, 0, None), islice(sub, 1, None), islice(sub, 2, None)):
        if all((a[-1] == "JJ", b[-1] == "CC", c[-1] == "JJ")):
            print("{} {} {}".format(a[0], b[0], c[0]))

Which outputs sad and unhappy, it does not include 'great and fun' as that does not match the pattern JJ-CC-JJ.

Or just using enumerate and a generator:

l = [[('reviewtext', 'IN'), ('this', 'DT'), ('movie', 'NN'), ('was', 'VBD'), ('great', 'JJ'), ('and', 'CC'),
      ('fun', 'NN'), ('i', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('this', 'DT'), ('awesome', 'NN'),
      ('movie', 'NN')],
     [('reviewtext', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('fun', 'VBN'), ('but', 'CC'), ('long', 'RB')],
     [('reviewtext', 'IN'), ('i', 'PRP'), ('loved', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('movie', 'NN'), ('my', 'PRP$'), ('brother', 'NN'), ('got', 'VBD'), ('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ'), ('at', 'IN'), ('the', 'DT'), ('end', 'NN')]]

def match(l,p1,p2,p3):
    for sub in l:
        # avoid index error and catch last three elements
        end = len(sub) - 1
        for ind, (a, b) in enumerate(sub, 1):
            if ind == end:
                break
            if b == p1 and sub[ind][1] == p2 and sub[ind + 1][1] == p3:
                yield ("{} {} {}".format(a, sub[ind][0], sub[ind + 1][0]))

print(list(match(l,"JJ","CC","JJ")))

Output (based on example):

['sad and unhappy']

How do I extract patterns from lists of POS tagged words? NLTK

Answers (2)

Related Questions