Gaga
Gaga

Reputation: 101

How to create a bag of word in Python

Dataframe test after I cleaned and tokenized it.

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
test['tokenize'] = test['tweet'].apply(tt.tokenize)
print(test)

output

0  congratulations dear friend ... [congratulations, dear, friend]
1  happy anniversary be happy  ... [happy, anniversary, be, happy]
2  make some sandwich          ...          [make, some, sandwich]

I would like to create a bag of word for my data. The following gave me error: 'list' object has no attribute 'lower'

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

BOW = vectorizer.fit_transform(test['tokenize'])
print(BOW.toarray())
print(vectorizer.get_feature_names())

The second one: AttributeError: 'list' object has no attribute 'split'

from collections import Counter
test['BOW'] = test['tokenize'].apply(lambda x: Counter(x.split(" ")))
print(test['BOW'])

Can you please assist me either method or both. Thanks!

Upvotes: 0

Views: 1010

Answers (2)

slesh
slesh

Reputation: 2007

As follows from your output example, the test['tokenize'] contains lists in cells. Those lists are values retrieved from string by splitting by " ", so to get this line test['BOW'] = test['tokenize'].apply(lambda x: Counter(x.split(" "))) working, try to change it into test['BOW'] = test['tokenize'].apply(lambda x: Counter(x))

Upvotes: 1

Bhaskar
Bhaskar

Reputation: 700

vectorizer.fit_transform takes an iterable of str, unicode, or file objects as a parameter. You have passed an iterable of lists (of tokenized strings). You can just pass the original set of strings, test['tweet'] as CountVectorizer does the tokenizing for you.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
BOW = vectorizer.fit_transform(test['tweet'])
print(BOW.toarray())
print(vectorizer.get_feature_names())

This should give you your expected output.

Upvotes: 1

Related Questions