Reputation: 1368
I want to run sklearn
's RandomForestClassifier
on some data that is packed as a numpy.ndarray
which happens to be sparse.
Calling fit
gives ValueError: setting an array element with a sequence.
. From other posts I understand that random forest cannot handle sparse data.
I expected the object to have a todense
method, but it doesn't.
>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
with 141256894 stored elements in Compressed Sparse Row format>,
dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>
I tried wrapping it with a SciPy csr_matrix
but that gives errors as well.
Is there any way to make random forest accept this data? (not sure that dense would actually fit in memory, but that's another thing...)
EDIT 1
The code generating the error is just this:
X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')
model = RandomForestClassifier()
model.fit(X_train, train_gt.target)
As for the suggestion to use toarray()
, ndarray does not have such method.
AttributeError: 'numpy.ndarray' object has no attribute 'toarray'
Moreover, as mentioned, for this particular data I would need terabytes of memory to hold the array. Is there an option to run RandomForestClassifier
with a sparse array?
EDIT 2
It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.
Upvotes: 3
Views: 16706
Reputation: 11
Since you've loaded a csr matrix using np.load, you need to convert it from an np array back to a csr matrix. You said you tried wrapping it with csr_matrix, but that's not the contents of the array, you need to all the .all()
temp = csr_matrix(X_train.all())
X_train = temp.toarray()
Upvotes: 1
Reputation: 1368
It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.
RandomForestClassifier
can run using data in this format.
The code has been running for 1:30h now, so hopefully it will actually finish :-)
Upvotes: 0
Reputation: 231335
>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
with 141256894 stored elements in Compressed Sparse Row format>,
dtype=object)
means that your code, or something it calls, has done np.array(M)
where M
is a csr
sparse matrix. It just wraps that matrix in a object dtype array.
To use a sparse matrix in code that doesn't take sparse matrices, you have to first convert them to dense:
arr = M.toarray() # or M.A same thing
mat = M.todense() # to make a np.matrix
But given the dimensions and number of nonzero elements, it is likely that this conversion will produce a memory error
.
Upvotes: 8
Reputation: 10306
I believe you're looking for the toarray
method, as shown in the documentation.
So you can do, e.g., X_dense = X_train.toarray()
.
Of course, then your computer crashes (unless you have the requisite 22 terabytes of RAM?).
Upvotes: 1