Reputation: 4640
I have the following matrices:
>>> X1
shape: (2399, 39999)
type: scipy.sparse.csr.csr_matrix
And
>> X2
shape: (2399, 333534)
type: scipy.sparse.csr.csr_matrix
And
>>>X3.reshape(-1,1)
shape: (2399, 1)
type: <class 'numpy.ndarray'>
How can I concatenate X1 and X2 by the right side in order to generate a new matrix with the following shape: (2399, 373534)
. I know that this can be done with scipy's hstack or vstack. However, when I tried to:
X_combined = sparse.hstack([X1,X2,X3.T])
However, I got a malformed final matrix:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Therefore, how can I concatenate correctly in a single matrix?.
UPDATE
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(min_df=5)
X1 = count_vect.fit_transform(X)
And
from sklearn.feature_extraction.text import TfidfVectorizer
tdidf_vect = TfidfVectorizer()
X2 = tdidf_vect.fit_transform(X)
And
from hdbscan import HDBSCAN
clusterer = HDBSCAN().fit(X1)
X3 = clusterer.labels_
print(X3.shape)
print(type(X3))
Then:
In:
import scipy as sparse
X_combined = sparse.hstack([X1,X2,X3.reshape(-1,1)])
Out:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-14baa47e0993> in <module>()
5
6
----> 7 X_combined = sparse.hstack([X1,X2,X3.reshape(-1,1)])
/usr/local/lib/python3.5/site-packages/numpy/core/shape_base.py in hstack(tup)
284 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
285 if arrs[0].ndim == 1:
--> 286 return _nx.concatenate(arrs, 0)
287 else:
288 return _nx.concatenate(arrs, 1)
ValueError: all the input arrays must have same number of dimensions
Upvotes: 1
Views: 291
Reputation: 152647
The problem is your import, it should be
from scipy import sparse
The top-level scipy
module (normally you shouldn't use the top-level scipy module anyway) imports the numpy functions, so when you try your version:
>>> import scipy as sparse
>>> sparse.hstack
<function numpy.core.shape_base.hstack>
>>> # incorrect! Correct would be
>>> from scipy import sparse
>>> sparse.hstack
<function scipy.sparse.construct.hstack>
This is all mentioned in their documentation:
The scipy namespace itself only contains functions imported from numpy. These functions still exist for backwards compatibility, but should be imported from numpy directly.
Everything in the namespaces of scipy submodules is public. In general, it is recommended to import functions from submodule namespaces.
Upvotes: 2
Reputation: 231385
Why the X3.T
? X3.reshape(-1,1)
shape is compatible with the others
sparse.hstack([X1,X2,X3.reshape(-1,1)])
should work.
[(2399, 39999), (2399, 333534), (2399, 1)]
The use of sparse.hstack
is correct here; but the the same rules about matching dimensions applies, whether sparse or dense.
In [207]: M
Out[207]:
<10x3 sparse matrix of type '<class 'numpy.int32'>'
with 9 stored elements in Compressed Sparse Row format>
In [208]: sparse.hstack((M,M))
Out[208]:
<10x6 sparse matrix of type '<class 'numpy.int32'>'
with 18 stored elements in COOrdinate format>
sparse.hstack
will convert A
to sparse before doing its version of concatenate.
In [209]: A=np.ones((10,1),int)
In [210]: sparse.hstack((M,M,A))
Out[210]:
<10x7 sparse matrix of type '<class 'numpy.int32'>'
with 28 stored elements in COOrdinate format>
or you could convert it to sparse first.
In [211]: As=sparse.csr_matrix(A)
In [212]: As
Out[212]:
<10x1 sparse matrix of type '<class 'numpy.int32'>'
with 10 stored elements in Compressed Sparse Row format>
In [213]: sparse.hstack((M,M,As))
Out[213]:
<10x7 sparse matrix of type '<class 'numpy.int32'>'
with 28 stored elements in COOrdinate format>
Starting with a 1d A
:
In [214]: A=np.ones((10),int)
In [215]: sparse.hstack([M,M,A.reshape(-1,1)])
Out[215]:
<10x7 sparse matrix of type '<class 'numpy.int32'>'
with 28 stored elements in COOrdinate format>
Upvotes: 2