edesz
edesz

Reputation: 12406

Dask dataframe has no attribute categorize

I am trying to store a Dask dataframe, with a categorical column, to a *.h5 file per this tutorial - 1:23:25 - 1:23:45.

Here is my call to a store function:

stored = store(ddf,'/home/HdPC/Analyzed.h5', ['Tag'])

The function store is:

@delayed
def store(ddf,fp,c):
    ddf.categorize(columns=c).to_hdf(fp, '/data2')

and uses categorize.

ddf and stored are of type:

print(type(ddf), type(stored))
>>> (<class 'dask.dataframe.core.DataFrame'>, <class 'dask.delayed.Delayed'>)

When I run compute(*[stored]) or stored.compute(), I get this:

dask.async.AttributeError: 'DataFrame' object has no attribute 'categorize'

Is there a way to achieve this categorization of the Tag column with the store function? Or should I use a different method to store the Dask dataframe with a categorical?

Upvotes: 1

Views: 1127

Answers (1)

mdurant
mdurant

Reputation: 28683

I would suggest you try the data-frame operations without the delayed call - daak-dataframes already are lazy compute graphs internally. I believe by calling compute, you are actually passing the resultant pandas data-frame to your function, which is why you get the error.

In your case: simply remove @delayed (remembering that to_hdf is a blocking call).

Upvotes: 2

Related Questions