Reputation: 1670
I made a preprocessing part for text analysis and after removing stopwords and stemming like this:
test[col] = test[col].apply(
lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])
train[col] = train[col].apply(
lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])
I've got a column with list of "cleaned words". Here are 3 rows in a column:
['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']
I now want to apply CountVectorizer to this column:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500, analyzer='word', lowercase=False) # will leave only 1500 words
X_train = cv.fit_transform(train[col])
But I got an Error:
TypeError: expected string or bytes-like object
It would be a bit strange to create string from list and than separate by CountVectorizer again.
Upvotes: 9
Views: 12789
Reputation: 193
To apply CountVectorizer to list of words you should disable analyzer.
x=[['ab','cd'], ['ab','de']]
vectorizer = CountVectorizer(analyzer=lambda x: x)
vectorizer.fit_transform(x).toarray()
Out:
array([[1, 1, 0],
[1, 0, 1]], dtype=int64)
Upvotes: 9
Reputation: 258
Your input should be list of strings or bytes, in this case you seem to provide list of list.
It looks like you already tokenized your string into tokens, inside separate lists. What you can do is a hack as below:
inp = [['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap',
'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps',
'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft',
'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']]
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']
inp = ["<some_space>".join(x) for x in inp]
vectorizer = CountVectorizer(tokenizer = lambda x: x.split("<some_space>"), analyzer="word")
vectorizer.fit_transform(inp)
Upvotes: 1
Reputation: 1670
As I found no other way to avoid an error, I joined the lists in column
train[col]=train[col].apply(lambda x: " ".join(x) )
test[col]=test[col].apply(lambda x: " ".join(x) )
Only after that I started to get the result
X_train = cv.fit_transform(train[col])
X_train=pd.DataFrame(X_train.toarray(), columns=cv.get_feature_names())
Upvotes: 4
Reputation: 348
When you use fit_transform
, the params passed in have to be an iterable of strings or bytes-like objects. Looks like you should be applying that over your column instead.
X_train = train[col].apply(lambda x: cv.fit_transform(x))
You can read the docs for fit_transform
here.
Upvotes: 0