Convert pandas sparse dataframe to sparse numpy matrix for sklearn use?

Question

I have some data, around 400 million rows, some features are categorical. I apply pandas.get_dummies to do one-hot encoding, and I have to use sparse=Trueoption because the data is a little big(otherwise exceptions/errors are raised).

result = result.drop(["time", "Ds"], 1)
result_encoded = pd.get_dummies(result, columns=["id1", "id2", "id3", "id4"], sparse=True)

Then, I get a sparse dataframe(result_encoded) with 9000 features. After that, I want to run a ridge regression on the data. At first, I tried to feed dataframe.value into sklearn,

train_data = result_encoded.drop(['count'].values)

but raised the error: "array is too big". Then, I just fed sparse dataframe to sklearn, similar error message showed again.

train_data = result_encoded.drop(['count'])

Do I need to consider a different method or preparation of the data so it can be used by sklearn directly?

Convert pandas sparse dataframe to sparse numpy matrix for sklearn use?

Answers (1)

Related Questions