How to create a bag of word in Python

Question

Dataframe test after I cleaned and tokenized it.

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
test['tokenize'] = test['tweet'].apply(tt.tokenize)
print(test)

output

0  congratulations dear friend ... [congratulations, dear, friend]
1  happy anniversary be happy  ... [happy, anniversary, be, happy]
2  make some sandwich          ...          [make, some, sandwich]

I would like to create a bag of word for my data. The following gave me error: 'list' object has no attribute 'lower'

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

BOW = vectorizer.fit_transform(test['tokenize'])
print(BOW.toarray())
print(vectorizer.get_feature_names())

The second one: AttributeError: 'list' object has no attribute 'split'

from collections import Counter
test['BOW'] = test['tokenize'].apply(lambda x: Counter(x.split(" ")))
print(test['BOW'])

Can you please assist me either method or both. Thanks!

Bhaskar · Accepted Answer

vectorizer.fit_transform takes an iterable of str, unicode, or file objects as a parameter. You have passed an iterable of lists (of tokenized strings). You can just pass the original set of strings, test['tweet'] as CountVectorizer does the tokenizing for you.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
BOW = vectorizer.fit_transform(test['tweet'])
print(BOW.toarray())
print(vectorizer.get_feature_names())

This should give you your expected output.

How to create a bag of word in Python

Answers (2)

Related Questions