Reputation: 2050
I have a dataset of 40,000 rows and 5000 columns of boolean values (1s and 0s) in a csv file. I cannot load this into numpy because it throws a MemoryError
.
I tried loading it into a sparse matrix as was answered in this question: csv to sparse matrix in python
However this format cannot be use in scikit-learn. Is there a way to read in the csv to a sparse matrix that can in fact be used by scikit-learn?
Loading in the matrix directly to numpy is done by:
matrix = np.loadtxt('data.csv', skiprows=1, delimiter=',')
Upvotes: 0
Views: 558
Reputation: 1476
The answer in the question you provided yields a lil_matrix. According to the scipy docs here, you can call matrix.tocsr()
to turn it into a csr_matrix. This should be useable in sklearn routines where sparse matrices are allowed. It would be more elegant to read your data directly into a csr_matrix, but for your dataset of boolean values, this should work alright.
Upvotes: 1