tumbleweed
tumbleweed

Reputation: 4640

Malformed matrix while using hstack?

I have the following matrices:

>>> X1
shape: (2399, 39999)
type: scipy.sparse.csr.csr_matrix

And

>> X2
shape: (2399, 333534)
type: scipy.sparse.csr.csr_matrix

And

>>>X3.reshape(-1,1)
shape: (2399, 1)
type: <class 'numpy.ndarray'>

How can I concatenate X1 and X2 by the right side in order to generate a new matrix with the following shape: (2399, 373534). I know that this can be done with scipy's hstack or vstack. However, when I tried to:

X_combined = sparse.hstack([X1,X2,X3.T])

However, I got a malformed final matrix:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Therefore, how can I concatenate correctly in a single matrix?.

UPDATE

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(min_df=5)
X1 = count_vect.fit_transform(X)

And

from sklearn.feature_extraction.text import TfidfVectorizer
tdidf_vect = TfidfVectorizer()
X2 = tdidf_vect.fit_transform(X)

And

from hdbscan import HDBSCAN
clusterer = HDBSCAN().fit(X1)
X3 = clusterer.labels_
print(X3.shape)
print(type(X3))

Then:

In:

import scipy as sparse

X_combined = sparse.hstack([X1,X2,X3.reshape(-1,1)])

Out:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-14baa47e0993> in <module>()
      5 
      6 
----> 7 X_combined = sparse.hstack([X1,X2,X3.reshape(-1,1)])

/usr/local/lib/python3.5/site-packages/numpy/core/shape_base.py in hstack(tup)
    284     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    285     if arrs[0].ndim == 1:
--> 286         return _nx.concatenate(arrs, 0)
    287     else:
    288         return _nx.concatenate(arrs, 1)

ValueError: all the input arrays must have same number of dimensions

Upvotes: 1

Views: 291

Answers (2)

MSeifert
MSeifert

Reputation: 152647

The problem is your import, it should be

from scipy import sparse

The top-level scipy module (normally you shouldn't use the top-level scipy module anyway) imports the numpy functions, so when you try your version:

>>> import scipy as sparse
>>> sparse.hstack
<function numpy.core.shape_base.hstack>

>>> # incorrect! Correct would be

>>> from scipy import sparse
>>> sparse.hstack
<function scipy.sparse.construct.hstack>

This is all mentioned in their documentation:

The scipy namespace itself only contains functions imported from numpy. These functions still exist for backwards compatibility, but should be imported from numpy directly.

Everything in the namespaces of scipy submodules is public. In general, it is recommended to import functions from submodule namespaces.

Upvotes: 2

hpaulj
hpaulj

Reputation: 231385

Why the X3.T? X3.reshape(-1,1) shape is compatible with the others

sparse.hstack([X1,X2,X3.reshape(-1,1)])

should work.

[(2399, 39999), (2399, 333534), (2399, 1)]

The use of sparse.hstack is correct here; but the the same rules about matching dimensions applies, whether sparse or dense.

In [207]: M
Out[207]: 
<10x3 sparse matrix of type '<class 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format>
In [208]: sparse.hstack((M,M))
Out[208]: 
<10x6 sparse matrix of type '<class 'numpy.int32'>'
    with 18 stored elements in COOrdinate format>

sparse.hstack will convert A to sparse before doing its version of concatenate.

In [209]: A=np.ones((10,1),int)
In [210]: sparse.hstack((M,M,A))
Out[210]: 
<10x7 sparse matrix of type '<class 'numpy.int32'>'
    with 28 stored elements in COOrdinate format>

or you could convert it to sparse first.

In [211]: As=sparse.csr_matrix(A)
In [212]: As
Out[212]: 
<10x1 sparse matrix of type '<class 'numpy.int32'>'
    with 10 stored elements in Compressed Sparse Row format>
In [213]: sparse.hstack((M,M,As))
Out[213]: 
<10x7 sparse matrix of type '<class 'numpy.int32'>'
    with 28 stored elements in COOrdinate format>

Starting with a 1d A:

In [214]: A=np.ones((10),int)
In [215]: sparse.hstack([M,M,A.reshape(-1,1)])
Out[215]: 
<10x7 sparse matrix of type '<class 'numpy.int32'>'
    with 28 stored elements in COOrdinate format>

Upvotes: 2

Related Questions