Yury Wallet
Yury Wallet

Reputation: 1670

Apply CountVectorizer to column with list of words in rows in Python

I made a preprocessing part for text analysis and after removing stopwords and stemming like this:

test[col] = test[col].apply(
    lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])

train[col] = train[col].apply(
    lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])

I've got a column with list of "cleaned words". Here are 3 rows in a column:

['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']

I now want to apply CountVectorizer to this column:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500, analyzer='word', lowercase=False) # will leave only 1500 words
X_train = cv.fit_transform(train[col])

But I got an Error:

TypeError: expected string or bytes-like object

It would be a bit strange to create string from list and than separate by CountVectorizer again.

Upvotes: 9

Views: 12789

Answers (4)

Aleksandr Gavrilov
Aleksandr Gavrilov

Reputation: 193

To apply CountVectorizer to list of words you should disable analyzer.

x=[['ab','cd'], ['ab','de']]
vectorizer = CountVectorizer(analyzer=lambda x: x)
vectorizer.fit_transform(x).toarray()

Out:
array([[1, 1, 0],
       [1, 0, 1]], dtype=int64)

Upvotes: 9

Kerem T
Kerem T

Reputation: 258

Your input should be list of strings or bytes, in this case you seem to provide list of list.

It looks like you already tokenized your string into tokens, inside separate lists. What you can do is a hack as below:

inp = [['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 
'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 
'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 
'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']]
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']


inp = ["<some_space>".join(x) for x in inp]

vectorizer = CountVectorizer(tokenizer = lambda x: x.split("<some_space>"), analyzer="word")

vectorizer.fit_transform(inp)

Upvotes: 1

Yury Wallet
Yury Wallet

Reputation: 1670

As I found no other way to avoid an error, I joined the lists in column

train[col]=train[col].apply(lambda x: " ".join(x) )
test[col]=test[col].apply(lambda x: " ".join(x) )

Only after that I started to get the result

X_train = cv.fit_transform(train[col])
X_train=pd.DataFrame(X_train.toarray(), columns=cv.get_feature_names())

Upvotes: 4

Justin
Justin

Reputation: 348

When you use fit_transform, the params passed in have to be an iterable of strings or bytes-like objects. Looks like you should be applying that over your column instead.

X_train = train[col].apply(lambda x: cv.fit_transform(x))

You can read the docs for fit_transform here.

Upvotes: 0

Related Questions