Tim
Tim

Reputation: 2050

Numpy/scipy load huge sparse matrix to use in scikit-learn

I have a dataset of 40,000 rows and 5000 columns of boolean values (1s and 0s) in a csv file. I cannot load this into numpy because it throws a MemoryError.

I tried loading it into a sparse matrix as was answered in this question: csv to sparse matrix in python

However this format cannot be use in scikit-learn. Is there a way to read in the csv to a sparse matrix that can in fact be used by scikit-learn?

Loading in the matrix directly to numpy is done by:

matrix = np.loadtxt('data.csv', skiprows=1, delimiter=',')

Upvotes: 0

Views: 558

Answers (1)

rabbit
rabbit

Reputation: 1476

The answer in the question you provided yields a lil_matrix. According to the scipy docs here, you can call matrix.tocsr() to turn it into a csr_matrix. This should be useable in sklearn routines where sparse matrices are allowed. It would be more elegant to read your data directly into a csr_matrix, but for your dataset of boolean values, this should work alright.

Upvotes: 1

Related Questions