Saurabh Sood
Saurabh Sood

Reputation: 233

Insert result of sklearn CountVectorizer in a pandas dataframe

I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer in sklearn, to convert the documents to feature vectors. I did this by calling:

vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)

where examples is an array of all the text documents

Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).

What would be a good way to insert the vectorized features into the pandas dataframe?

Upvotes: 19

Views: 15682

Answers (2)

Nickil Maveli
Nickil Maveli

Reputation: 29711

Return term-document matrix after learning the vocab dictionary from the raw documents.

X = vect.fit_transform(docs) 

Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.

count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names_out())

Concatenate the original df and the count_vect_df columnwise.

pd.concat([df, count_vect_df], axis=1)

Upvotes: 39

Tchotchke
Tchotchke

Reputation: 3121

If your base data frame is df, all you need to do is:

import pandas as pd    
features_df = pd.DataFrame(features)
combined_df = pd.concat([df, features_df], axis=1)

I'd recommend some options to reduce the number of features, which could be useful depending on what type of analysis you're doing. For example, if you haven't already, I'd suggest looking into removing stop words and stemming. Additionally you can set max_features, like features = vectorizer.fit_transform(examples, max_features = 1000) to limit the number of features.

Upvotes: -1

Related Questions