scipy.sparse.csr_matrix row filtering - how to properly achieve it?

Question

I was working with some scipy.sparse.csr_matrixes. Honestly, the one I have at hand is from Scikit-learn's TfidfVectorizer:

vectorizer = TfidfVectorizer(min_df=0.0005)
textsMet2 = vectorizer.fit_transform(textsMet)

Ok, so this is a matrix:

textsMet2
<999x1632 sparse matrix of type ''
    with 5042 stored elements in Compressed Sparse Row format>

Now I want to get only those rows which have any non-zero elements. So obviously I go for simple indexing:

 textsMet2[(textsMet2.sum(axis=1)>0),:]

And get a error:

File "D:\Apps\Python\lib\site-packages\scipy\sparse\sputils.py", line 327, in _boolean_index_to_array raise IndexError('invalid index shape') IndexError: invalid index shape

If I remove last part of indexing I get something strange:

textsMet2[(textsMet2.sum(axis=1)>0)]
<1x492 sparse matrix of type ''
with 1 stored elements in Compressed Sparse Row format>

Why it shows me just 1 row matrix?

Once again, I want to get all of the rows of this matrix which have any non-zero element. Anyone knows how to do this?

mbatchkarov · Accepted Answer

You need to ravel your mask. Here is a bit of code from the thing I'm working on at the moment:

    tr_matrix = pipeline.fit_transform(train_text, y_train, **fit_params)

    # remove documents with too few features
    to_keep_train = tr_matrix.sum(axis=1) >= config['min_train_features']
    to_keep_train = np.ravel(np.array(to_keep_train))
    logging.info('%d/%d train documents have enough features', 
                 sum(to_keep_train), len(y_train))
    tr_matrix = tr_matrix[to_keep_train, :]

This is a little inelegant but gets the job done.

scipy.sparse.csr_matrix row filtering - how to properly achieve it?

Answers (2)

Related Questions