Frank B.
Frank B.

Reputation: 1873

Dask get_dummies Does Not Transform Variable(s)

I'm trying to use get_dummies via dask but it does not transform my variable, nor does it error out:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_d.head()
   uid gender
0    1      M
1    2    NaN
2    3    NaN
3    4      F
4    5    NaN
>>> daskDataCategorical = df_d[['gender']]
>>> daskDataDummies = dd.get_dummies(daskDataCategorical) 
>>> daskDataDummies.head()
  gender
0      M
1    NaN
2    NaN
3      F
4    NaN
>>> daskDataDummies.compute() 
  gender
0      M
1    NaN
2    NaN
3      F
4    NaN
5      F
6      M
7      F
8      M
9      F
>>>

The pandas equivilent (run in a new terminal just in case) is:

>>> import pandas as pd
>>> df_p = pd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_p.head()
   uid gender
0    1      M
1    2    NaN
2    3    NaN
3    4      F
4    5    NaN
>>> pandasDataCategorical = df_p[['gender']]
>>> pandasDataDummies = pd.get_dummies(pandasDataCategorical)
>>> pandasDataDummies.head()
   gender_F  gender_M
0       0.0       1.0
1       0.0       0.0
2       0.0       0.0
3       1.0       0.0
4       0.0       0.0
>>> 

My understanding of this resolved issue is that it should work, but is it required to be pulled into pandas first? If so it defeats the purpose of me using it since my datasets (~500GB) won't fit into a pandas dataframe. Am I misreading this? TIA.

Upvotes: 5

Views: 9833

Answers (1)

TomAugspurger
TomAugspurger

Reputation: 28946

You'll want to convert your column of strings to a Categorical before trying to use get_dummies. This pull request added a dask.dataframe.get_dummies, which will error if you try to pass object (string) columns, unlike pd.get_dummies.

To get a Categorical you can either use .categorize before dd.get_dummies, or with pandas >= 0.19, use read in your CSV with the dtype keyword like

df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv', dtype={"gender": "category"})

Here's a small example:

In [2]: import dask.dataframe as dd

In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)

In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[4]:
   A
0  a
1  b
2  a

In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)

/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
     68         if columns is None:
     69             if (data.dtypes == 'object').any():
---> 70                 raise NotImplementedError(not_cat_msg)
     71             columns = data._meta.select_dtypes(include=['category']).columns
     72         else:

NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.

In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
   A_a  A_b  A_c
0    1    0    0
1    0    1    0
2    1    0    0
3    0    1    0
4    0    0    1

Dask requires categoricals for get_dummies because it needs to know all of the new dummy-variables it needs to create. pandas doesn't have to worry about this since all of your data is already in memory.

Upvotes: 7

Related Questions