user3162587
user3162587

Reputation: 233

Convert pandas sparse dataframe to sparse numpy matrix for sklearn use?

I have some data, around 400 million rows, some features are categorical. I apply pandas.get_dummies to do one-hot encoding, and I have to use sparse=Trueoption because the data is a little big(otherwise exceptions/errors are raised).

result = result.drop(["time", "Ds"], 1)
result_encoded = pd.get_dummies(result, columns=["id1", "id2", "id3", "id4"], sparse=True)

Then, I get a sparse dataframe(result_encoded) with 9000 features. After that, I want to run a ridge regression on the data. At first, I tried to feed dataframe.value into sklearn,

train_data = result_encoded.drop(['count'].values)

but raised the error: "array is too big". Then, I just fed sparse dataframe to sklearn, similar error message showed again.

train_data = result_encoded.drop(['count'])

Do I need to consider a different method or preparation of the data so it can be used by sklearn directly?

Upvotes: 2

Views: 1523

Answers (1)

Marc Garcia
Marc Garcia

Reputation: 3497

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:

result_encoded, idx_rows, idx_cols = result_encoded.stack().to_sparse().to_coo()
result_encoded = result_encoded.tocsr()

This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().

The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).

In general, you'll want to more efficient CSR or CCR format for your sparse scipy array, instead of the simpler COO, so you can convert it with the .tocsr() method.

  1. http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse

Upvotes: 2

Related Questions