ATMA
ATMA

Reputation: 1468

pandas and nltk: get most common phrases

Fairly new to python and I'm working with pandas data frames with a column full of text. I'm trying to take that column and use nltk to find common phrases (three or four word).

    dat["text_clean"] = 
    dat["Description"].str.replace('[^\w\s]','').str.lower()

dat["text_clean2"] = dat["text_clean"].apply(word_tokenize)

finder = BigramCollocationFinder.from_words(dat["text_clean2"])
finder
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

The initial comments seem to work fine. However, when I attempt to use BigramCollocation, it throws the following error.

n [437]: finder = BigramCollocationFinder.from_words(dat["text_clean2"])
finder

Traceback (most recent call last):

  File "<ipython-input-437-635c3b3afaf4>", line 1, in <module>
    finder = BigramCollocationFinder.from_words(dat["text_clean2"])

  File "/Users/abrahammathew/anaconda/lib/python2.7/site-packages/nltk/collocations.py", line 168, in from_words
    wfd[w1] += 1

TypeError: unhashable type: 'list'

Any idea what this refers or a workaround.

Same error with the following commands also.

gg = dat["text_clean2"].tolist()    
finder = BigramCollocationFinder.from_words(gg)
finder = BigramCollocationFinder.from_words(dat["text_clean2"].values.reshape(-1, ))

The following works, but returns that there are no common phrases.

gg = dat["Description"].str.replace('[^\w\s]','').str.lower()
finder = BigramCollocationFinder.from_words(gg)
finder
# only bigrams that appear 3+ times
finder.apply_freq_filter(2)
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

Upvotes: 1

Views: 1375

Answers (3)

chadac
chadac

Reputation: 417

CollocationFinder.from_words is for a single document. You want to use from_documents:

finder = BigramCollocationFinder.from_documents(gg)

Upvotes: 1

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

You might have to covert the list of lists into list of tuples. Hope this works

dat['text_clean2'] = [tuple(x) for x in dat['text_clean2']]
finder = BigramCollocationFinder.from_words(dat["text_clean2"])

Upvotes: 1

cs95
cs95

Reputation: 402333

It would seem your BigramCollocationFinder class wants a list of words, not a list of lists. Try this:

finder = BigramCollocationFinder.from_words(dat["text_clean2"].values.reshape(-1, ))

Upvotes: 1

Related Questions