davidvera
davidvera

Reputation: 1489

Dask : NotImplementedError: `df.column.cat.codes` with unknown categories is not supported

I used this code to create a column for creating a product id in a dataframe :

df = df.assign(id=(df['PROD_NAME']).astype('category').cat.codes)

This code works fine if I use pandas. This line allows me to create an id for each PROD_NAME value. My issue is that I want to use Dask that allows me to manage several clients and handle memory issues.

I obtain the following error message :

NotImplementedError: `df.column.cat.codes` with unknown categories is not supported.  Please use `column.cat.as_known()` or `df.categorize()` beforehand to ensure known categories

How can i create this new column then ?

Upvotes: 2

Views: 648

Answers (1)

AlexK
AlexK

Reputation: 3011

This is an old post, but being the first that comes up when searching for this error, it could use an answer:

TL;DR:

Run this sequence on your Dask dataframe:

ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_known()
ddf = ddf.assign(id=(ddf["PROD_NAME"].cat.codes))
out_df = ddf.compute()

Per Dask's documentation, you can convert categorical data types in Dask between "known categoricals" and "unknown categoricals". In this situation, it needs "known" categories, because it will need to pull category mapping from column metadata.

import pandas as pd
from dask import dataframe as dd

# Show the pandas workflow
>>> d = pd.Series(['A','B','D'], dtype='category').to_frame(name=“PROD_NAME”)
>>> d = d.assign(id=(d["PROD_NAME"]).astype('category').cat.codes)
>>> d
   PROD_NAME  id
0          A   0
1          B   1
2          D   2

# Now, in Dask:
>>> ddf = dd.from_pandas(d, npartitions=1)
>>> ddf
Dask DataFrame Structure:
                     PROD_NAME
npartitions=1                 
0              category[known]
2                          ...
Dask Name: from_pandas, 1 tasks

# The conversion to Dask dataframe already created a "known categorical", but
# let's convert it to "unknown" (notice the .compute() is not used):
>>> ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_unknown()
>>> ddf
Dask DataFrame Structure:
                       PROD_NAME
npartitions=1                   
0              category[unknown]
2                            ...
Dask Name: assign, 3 tasks

# Now, let's convert it back to "known", then create the new column using .assign()
# and call .compute() to create output dataframe:
>>> ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_known()
>>> ddf = ddf.assign(id=(ddf["PROD_NAME"].cat.codes))
>>> out_df = ddf.compute()
>>> out_df
  PROD_NAME  id
0         A   0
1         B   1
2         D   2

Upvotes: 2

Related Questions