Kathirmani Sukumar
Kathirmani Sukumar

Reputation: 10970

Filter only certain words from sklearn CountVectorizer sparse matrix

I have a pandas series with full of text inside it. Using CountVectorizer function in sklearn package, I have calculated the sparse matrix. I have identified the top words as well. Now I want to filter my sparse matrix for only those top words.

The original data contains more than 7000 rows and contains more than 75000 words. Hence I am creating a sample data here

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
words = pd.Series(['This is first row of the text column',
                   'This is second row of the text column',
                   'This is third row of the text column',
                   'This is fourth row of the text column',
                   'This is fifth row of the text column'])
count_vec = CountVectorizer(stop_words='english')
sparse_matrix = count_vec.fit_transform(words)

I have created the sparse matrix for all the words in that column. Here just to print my sparse matrix, i am converting it to array using .toarray() function.

print count_vec.get_feature_names()
print sparse_matrix.toarray()
[u'column', u'fifth', u'fourth', u'row', u'second', u'text']
[[1 0 0 1 0 1]
 [1 0 0 1 1 1]
 [1 0 0 1 0 1]
 [1 0 1 1 0 1]
 [1 1 0 1 0 1]]

Now I am looking for frequently appearing words using the following

# Get frequency count of all features
features_count = sparse_matrix.sum(axis=0).tolist()[0]
features_names = count_vec.get_feature_names()
features = pd.DataFrame(zip(features_names, features_count), 
                                columns=['features', 'count']
                               ).sort_values(by=['count'], ascending=False)

  features  count
0   column      5
3      row      5
5     text      5
1    fifth      1
2   fourth      1
4   second      1

From the above result we know that the frequently appearing words are column, row & text. Now I want to filter my sparse matrix only for these words. I dont to convert my sparse matrix to array and then filter. Because I get memory error in my original data, since the number of words are quite high.

The only way I was able to get the sparse matrix is to again repeat the steps with those specific words using vocabulary attribute, like this

countvec_subset = CountVectorizer(vocabulary= ['column', 'text', 'row'])

Instead I am looking for a better solution, where I can filter the sparse matrix directly for those words, instead of creating it again from scratch.

Upvotes: 3

Views: 2794

Answers (1)

Zero
Zero

Reputation: 76927

You can work with slicing the sparse matrix. You'd need to derive columns for slicing. sparse_matrix[:, columns]

In [56]: feature_count = sparse_matrix.sum(axis=0)

In [57]: columns = tuple(np.where(feature_count == feature_count.max())[1])

In [58]: columns
Out[58]: (0, 3, 5)

In [59]: sparse_matrix[:, columns].toarray()
Out[59]:
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]], dtype=int64)

In [60]: type(sparse_matrix[:, columns])
Out[60]: scipy.sparse.csr.csr_matrix

In [71]: np.array(features_names)[list(columns)]
Out[71]:
array([u'column', u'row', u'text'],
      dtype='<U6')

The sliced subset is still a scipy.sparse.csr.csr_matrix

Upvotes: 5

Related Questions