Reputation: 101
Dataframe test after I cleaned and tokenized it.
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
test['tokenize'] = test['tweet'].apply(tt.tokenize)
print(test)
output
0 congratulations dear friend ... [congratulations, dear, friend]
1 happy anniversary be happy ... [happy, anniversary, be, happy]
2 make some sandwich ... [make, some, sandwich]
I would like to create a bag of word for my data. The following gave me error: 'list' object has no attribute 'lower'
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
BOW = vectorizer.fit_transform(test['tokenize'])
print(BOW.toarray())
print(vectorizer.get_feature_names())
The second one: AttributeError: 'list' object has no attribute 'split'
from collections import Counter
test['BOW'] = test['tokenize'].apply(lambda x: Counter(x.split(" ")))
print(test['BOW'])
Can you please assist me either method or both. Thanks!
Upvotes: 0
Views: 1010
Reputation: 2007
As follows from your output example, the test['tokenize'] contains lists in cells. Those lists are values retrieved from string by splitting by " ", so to get this line test['BOW'] = test['tokenize'].apply(lambda x: Counter(x.split(" ")))
working, try to change it into test['BOW'] = test['tokenize'].apply(lambda x: Counter(x))
Upvotes: 1
Reputation: 700
vectorizer.fit_transform
takes an iterable of str, unicode, or file objects as a parameter. You have passed an iterable of lists (of tokenized strings). You can just pass the original set of strings, test['tweet']
as CountVectorizer does the tokenizing for you.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
BOW = vectorizer.fit_transform(test['tweet'])
print(BOW.toarray())
print(vectorizer.get_feature_names())
This should give you your expected output.
Upvotes: 1