Reputation: 10970
I have a pandas series with full of text inside it. Using CountVectorizer
function in sklearn
package, I have calculated the sparse matrix. I have identified the top words as well. Now I want to filter my sparse matrix for only those top words.
The original data contains more than 7000
rows and contains more than 75000
words. Hence I am creating a sample data here
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
words = pd.Series(['This is first row of the text column',
'This is second row of the text column',
'This is third row of the text column',
'This is fourth row of the text column',
'This is fifth row of the text column'])
count_vec = CountVectorizer(stop_words='english')
sparse_matrix = count_vec.fit_transform(words)
I have created the sparse matrix for all the words in that column. Here just to print my sparse matrix, i am converting it to array using .toarray()
function.
print count_vec.get_feature_names()
print sparse_matrix.toarray()
[u'column', u'fifth', u'fourth', u'row', u'second', u'text']
[[1 0 0 1 0 1]
[1 0 0 1 1 1]
[1 0 0 1 0 1]
[1 0 1 1 0 1]
[1 1 0 1 0 1]]
Now I am looking for frequently appearing words using the following
# Get frequency count of all features
features_count = sparse_matrix.sum(axis=0).tolist()[0]
features_names = count_vec.get_feature_names()
features = pd.DataFrame(zip(features_names, features_count),
columns=['features', 'count']
).sort_values(by=['count'], ascending=False)
features count
0 column 5
3 row 5
5 text 5
1 fifth 1
2 fourth 1
4 second 1
From the above result we know that the frequently appearing words are column
, row
& text
. Now I want to filter my sparse matrix only for these words. I dont to convert my sparse matrix to array and then filter. Because I get memory error in my original data, since the number of words are quite high.
The only way I was able to get the sparse matrix is to again repeat the steps with those specific words using vocabulary
attribute, like this
countvec_subset = CountVectorizer(vocabulary= ['column', 'text', 'row'])
Instead I am looking for a better solution, where I can filter the sparse matrix directly for those words, instead of creating it again from scratch.
Upvotes: 3
Views: 2794
Reputation: 76927
You can work with slicing the sparse matrix. You'd need to derive columns for slicing. sparse_matrix[:, columns]
In [56]: feature_count = sparse_matrix.sum(axis=0)
In [57]: columns = tuple(np.where(feature_count == feature_count.max())[1])
In [58]: columns
Out[58]: (0, 3, 5)
In [59]: sparse_matrix[:, columns].toarray()
Out[59]:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[1, 1, 1]], dtype=int64)
In [60]: type(sparse_matrix[:, columns])
Out[60]: scipy.sparse.csr.csr_matrix
In [71]: np.array(features_names)[list(columns)]
Out[71]:
array([u'column', u'row', u'text'],
dtype='<U6')
The sliced subset is still a scipy.sparse.csr.csr_matrix
Upvotes: 5