Reputation: 4792
I was working with some scipy.sparse.csr_matrixes. Honestly, the one I have at hand is from Scikit-learn's TfidfVectorizer:
vectorizer = TfidfVectorizer(min_df=0.0005)
textsMet2 = vectorizer.fit_transform(textsMet)
Ok, so this is a matrix:
textsMet2
<999x1632 sparse matrix of type '<class 'numpy.float64'>'
with 5042 stored elements in Compressed Sparse Row format>
Now I want to get only those rows which have any non-zero elements. So obviously I go for simple indexing:
textsMet2[(textsMet2.sum(axis=1)>0),:]
And get a error:
File "D:\Apps\Python\lib\site-packages\scipy\sparse\sputils.py", line 327, in _boolean_index_to_array raise IndexError('invalid index shape') IndexError: invalid index shape
If I remove last part of indexing I get something strange:
textsMet2[(textsMet2.sum(axis=1)>0)]
<1x492 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
Why it shows me just 1 row matrix?
Once again, I want to get all of the rows of this matrix which have any non-zero element. Anyone knows how to do this?
Upvotes: 1
Views: 2478
Reputation: 641
This will remove 0 rows and columns.
X = X[np.array(np.sum(X,axis=1)).ravel() != 0,:]
X = X[:,np.array(np.sum(X,axis=0)).ravel() != 0]
Upvotes: 0
Reputation: 16049
You need to ravel
your mask. Here is a bit of code from the thing I'm working on at the moment:
tr_matrix = pipeline.fit_transform(train_text, y_train, **fit_params) # remove documents with too few features to_keep_train = tr_matrix.sum(axis=1) >= config['min_train_features'] to_keep_train = np.ravel(np.array(to_keep_train)) logging.info('%d/%d train documents have enough features', sum(to_keep_train), len(y_train)) tr_matrix = tr_matrix[to_keep_train, :]
This is a little inelegant but gets the job done.
Upvotes: 1