Reputation: 233
I have some data, around 400 million rows, some features are categorical. I apply pandas.get_dummies
to do one-hot encoding, and I have to use sparse=True
option because the data is a little big(otherwise exceptions/errors are raised).
result = result.drop(["time", "Ds"], 1)
result_encoded = pd.get_dummies(result, columns=["id1", "id2", "id3", "id4"], sparse=True)
Then, I get a sparse dataframe(result_encoded) with 9000 features. After that, I want to run a ridge regression on the data. At first, I tried to feed dataframe.value
into sklearn,
train_data = result_encoded.drop(['count'].values)
but raised the error: "array is too big". Then, I just fed sparse dataframe to sklearn, similar error message showed again.
train_data = result_encoded.drop(['count'])
Do I need to consider a different method or preparation of the data so it can be used by sklearn directly?
Upvotes: 2
Views: 1523
Reputation: 3497
You should be able to use the experimental .to_coo()
method in pandas [1] in the following way:
result_encoded, idx_rows, idx_cols = result_encoded.stack().to_sparse().to_coo()
result_encoded = result_encoded.tocsr()
This method, instead of taking a DataFrame
(rows / columns) it takes a Series
with rows and columns in a MultiIndex
(this is why you need the .stack()
method). This Series
with the MultiIndex
needs to be a SparseSeries
, and even if your input is a SparseDataFrame
, .stack()
returns a regular Series
. So, you need to use the .to_sparse()
method before calling .to_coo()
.
The Series
returned by .stack()
, even if it's not a SparseSeries
only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan
when the type is np.float
).
In general, you'll want to more efficient CSR
or CCR
format for your sparse scipy array, instead of the simpler COO
, so you can convert it with the .tocsr()
method.
Upvotes: 2