Reputation: 1368

numpy.ndarray sparse matrix to dense

I want to run sklearn's RandomForestClassifier on some data that is packed as a numpy.ndarray which happens to be sparse. Calling fit gives ValueError: setting an array element with a sequence.. From other posts I understand that random forest cannot handle sparse data.

I expected the object to have a todense method, but it doesn't.

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>

I tried wrapping it with a SciPy csr_matrix but that gives errors as well.

Is there any way to make random forest accept this data? (not sure that dense would actually fit in memory, but that's another thing...)

EDIT 1

The code generating the error is just this:

X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')

model = RandomForestClassifier()
model.fit(X_train, train_gt.target)

As for the suggestion to use toarray(), ndarray does not have such method. AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Moreover, as mentioned, for this particular data I would need terabytes of memory to hold the array. Is there an option to run RandomForestClassifier with a sparse array?

EDIT 2

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.

Upvotes: 3

Answers (4)

KivLaughLove

Reputation: 11

Since you've loaded a csr matrix using np.load, you need to convert it from an np array back to a csr matrix. You said you tried wrapping it with csr_matrix, but that's not the contents of the array, you need to all the .all()

temp = csr_matrix(X_train.all())
X_train = temp.toarray()

Upvotes: 1

mibm

Reputation: 1368

RandomForestClassifier can run using data in this format. The code has been running for 1:30h now, so hopefully it will actually finish :-)

Upvotes: 0

hpaulj

Reputation: 231665

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)

means that your code, or something it calls, has done np.array(M) where M is a csr sparse matrix. It just wraps that matrix in a object dtype array.

To use a sparse matrix in code that doesn't take sparse matrices, you have to first convert them to dense:

 arr = M.toarray()    # or M.A same thing
 mat = M.todense()    # to make a np.matrix

But given the dimensions and number of nonzero elements, it is likely that this conversion will produce a memory error.

Upvotes: 8

Nathan

Reputation: 10336

I believe you're looking for the toarray method, as shown in the documentation.

So you can do, e.g., X_dense = X_train.toarray().

Of course, then your computer crashes (unless you have the requisite 22 terabytes of RAM?).

Upvotes: 1

numpy.ndarray sparse matrix to dense

Answers (4)

Related Questions