Reputation: 1873
I'm trying to use get_dummies
via dask
but it does not transform my variable, nor does it error out:
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_d.head()
uid gender
0 1 M
1 2 NaN
2 3 NaN
3 4 F
4 5 NaN
>>> daskDataCategorical = df_d[['gender']]
>>> daskDataDummies = dd.get_dummies(daskDataCategorical)
>>> daskDataDummies.head()
gender
0 M
1 NaN
2 NaN
3 F
4 NaN
>>> daskDataDummies.compute()
gender
0 M
1 NaN
2 NaN
3 F
4 NaN
5 F
6 M
7 F
8 M
9 F
>>>
The pandas
equivilent (run in a new terminal just in case) is:
>>> import pandas as pd
>>> df_p = pd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_p.head()
uid gender
0 1 M
1 2 NaN
2 3 NaN
3 4 F
4 5 NaN
>>> pandasDataCategorical = df_p[['gender']]
>>> pandasDataDummies = pd.get_dummies(pandasDataCategorical)
>>> pandasDataDummies.head()
gender_F gender_M
0 0.0 1.0
1 0.0 0.0
2 0.0 0.0
3 1.0 0.0
4 0.0 0.0
>>>
My understanding of this resolved issue is that it should work, but is it required to be pulled into pandas
first? If so it defeats the purpose of me using it since my datasets (~500GB) won't fit into a pandas
dataframe. Am I misreading this? TIA.
Upvotes: 5
Views: 9833
Reputation: 28946
You'll want to convert your column of strings to a Categorical
before trying to use get_dummies
. This pull request added a dask.dataframe.get_dummies
, which will error if you try to pass object
(string) columns, unlike pd.get_dummies
.
To get a Categorical
you can either use .categorize
before dd.get_dummies
, or with pandas >= 0.19, use read in your CSV with the dtype
keyword like
df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv', dtype={"gender": "category"})
Here's a small example:
In [2]: import dask.dataframe as dd
In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)
In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
warnings.warn(msg.format(n, len(r)))
Out[4]:
A
0 a
1 b
2 a
In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
68 if columns is None:
69 if (data.dtypes == 'object').any():
---> 70 raise NotImplementedError(not_cat_msg)
71 columns = data._meta.select_dtypes(include=['category']).columns
72 else:
NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.
In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
A_a A_b A_c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 1 0
4 0 0 1
Dask requires categoricals for get_dummies
because it needs to know all of the new dummy-variables it needs to create. pandas doesn't have to worry about this since all of your data is already in memory.
Upvotes: 7