blackbaka
blackbaka

Reputation: 65

How to put key-words in NLTK tokenize?

Input:"My favorite game is call of duty."

And I set "call of duty" as a key-words, this phrase will be one word in tokenize process.

Finally want to get the result:['my','favorite','game','is','call of duty']

So, how to set the key-words in python NLP ?

Upvotes: 1

Views: 1878

Answers (2)

Aj Langley
Aj Langley

Reputation: 127

This is, of course, way too late to be useful to the OP, but I thought I'd put this answer here for others:

It sounds like what you might be really asking is: How do I make sure that compound phrases like 'call of duty' get grouped together as one token?

You can use nltk's multiword expression tokenizer, like so:

string = 'My favorite game is call of duty'
tokenized_string = nltk.word_tokenize(string)

mwe = [('call', 'of', 'duty')]
mwe_tokenizer = nltk.tokenize.MWETokenizer(mwe)
tokenized_string = mwe_tokenizer.tokenize(tokenized_string)

Where mwestands for multi-word expression. The value of tokenized_string will be ['My', 'favorite', 'game', 'is', 'call of duty']

Upvotes: 1

David Batista
David Batista

Reputation: 3134

I think what you want is keyphrase extraction, and you can do it for instance by first tagging each word with it's PoS-tag and then apply some sort of regular expression over the PoS-tags to join interesting words into keyphrases.

import nltk
from nltk import pos_tag
from nltk import tokenize

def extract_phrases(my_tree, phrase):
   my_phrases = []
   if my_tree.label() == phrase:
       my_phrases.append(my_tree.copy(True))

   for child in my_tree:
       if type(child) is nltk.Tree:
           list_of_phrases = extract_phrases(child, phrase)
           if len(list_of_phrases) > 0:
               my_phrases.extend(list_of_phrases)

   return my_phrases


def main():
    sentences = ["My favorite game is call of duty"]

    grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
    cp = nltk.RegexpParser(grammar)

    for x in sentences:
        sentence = pos_tag(tokenize.word_tokenize(x))
        tree = cp.parse(sentence)
        print "\nNoun phrases:"
        list_of_noun_phrases = extract_phrases(tree, 'NP')
        for phrase in list_of_noun_phrases:
            print phrase, "_".join([x[0] for x in phrase.leaves()])

if __name__ == "__main__":
    main()

This will output the following:

Noun phrases:
(NP favorite/JJ game/NN) favorite_game
(NP call/NN) call
(NP duty/NN) duty

But,you can play around with

grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"

trying other types of expressions, so that you can get exactly what you want, depending on the words/tags you want to join together.

Also if you are interested, check this very good introduction to keyphrase/word extraction:

https://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/

Upvotes: 3

Related Questions