Reputation: 127
I am trying statsmodels to fit my data to a Logistic Regression model (Logit) but the dataframe I have is not a pandas dataframe but a Dask dataframe.
This is my sample dataset: smarket_1:
Response Variable: Direction
const Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
0 1.0 2001.0 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 1.0
1 1.0 2001.0 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 1.0
2 1.0 2001.0 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 0.0
3 1.0 2001.0 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 1.0
4 1.0 2001.0 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 1.0
5 1.0 2001.0 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 1.0
6 1.0 2001.0 1.392 0.213 0.614 -0.623 1.032 1.4450 -0.403 0.0
7 1.0 2001.0 -0.403 1.392 0.213 0.614 -0.623 1.4078 0.027 1.0
8 1.0 2001.0 0.027 -0.403 1.392 0.213 0.614 1.1640 1.303 1.0
9 1.0 2001.0 1.303 0.027 -0.403 1.392 0.213 1.2326 0.287 1.0
10 1.0 2001.0 0.287 1.303 0.027 -0.403 1.392 1.3090 -0.498 0.0
11 1.0 2001.0 -0.498 0.287 1.303 0.027 -0.403 1.2580 -0.189 0.0
12 1.0 2001.0 -0.189 -0.498 0.287 1.303 0.027 1.0980 0.680 1.0
13 1.0 2001.0 0.680 -0.189 -0.498 0.287 1.303 1.0531 0.701 1.0
14 1.0 2001.0 0.701 0.680 -0.189 -0.498 0.287 1.1498 -0.562 0.0
15 1.0 2001.0 -0.562 0.701 0.680 -0.189 -0.498 1.2953 0.546 1.0
16 1.0 2001.0 0.546 -0.562 0.701 0.680 -0.189 1.1188 -1.747 0.0
17 1.0 2001.0 -1.747 0.546 -0.562 0.701 0.680 1.0484 0.359 1.0
18 1.0 2001.0 0.359 -1.747 0.546 -0.562 0.701 1.0130 -0.151 0.0
19 1.0 2001.0 -0.151 0.359 -1.747 0.546 -0.562 1.0596 -0.841 0.0
So, when I use the Logit
class from statsmodels
and fit my data:
from statsmodels.api import Logit
logistict_reg = Logit(endog = smarket_1['Direction'], exog = smarket_1.drop(labels= 'Direction', axis = 1)).fit()
logistic_reg.summary()
I am getting the below error saying:
ValueError: unrecognized data structures: <class 'dask.dataframe.core.DataFrame'> / <class 'dask.dataframe.core.DataFrame'>
Next, when I tried converting the dask dataframe to a pandas one using .compute()
as follows:
from statsmodels.api import Logit
logistict_reg = Logit(endog = smarket_1['Direction'], exog = smarket_1.drop(labels= 'Direction', axis = 1).compute()).fit()
I am getting error saying:
AttributeError: 'Index' object has no attribute 'equals'
However, when I passed the same dask dataframe to sklearn's Logistic Regression model it worked witout any error.
So does Statsmodels doesnt support/works with Dask dataframe ?
Upvotes: 1
Views: 384
Reputation: 15432
No - you can’t use scikit-learn or statsmodels with dask arrays or dataframes. These libraries are based on numpy data structures and have no support for out-of-core or delayed operations.
Instead, use the library dask-ml, which is party of the dask ecosystem, works directly with these data structures, and is designed to be similar to these numpy-based frameworks, but using the dask scheduler.
Note that some algorithms you may be working with do not scale well (or at all) to parallel or partitioned datasets. Dask-ml has implemented a number of algorithms which are similar, but use approximation or sampling methods to achieve similar (but not identical) results. So be prepared to read up on the available methods and to be flexible in your need for exact solutions. Otherwise, your only option is to use a machine with more memory and compute the collection so you can use the numpy-based libraries.
Upvotes: 2