Stanko
Stanko

Reputation: 4475

Why does Dask fill in "foo" and 1 in my Dataframe

I've read in around 15 csv files:

df = dd.read_csv("gs://project/*.csv", blocksize=25e6,
                 storage_options={'token': fs.session.credentials})

Then I persisted the Dataframe (it uses 7.33 GB memory):

df = df.persist()

I set a new index because I want my group by on that field to be as efficient as possible:

df = df.set_index('column_a').persist()

Now I have 181 divisions and 180 partitions. To try out how fast my group by was going I tried a custom apply function that just prints the Group Dataframe:

grouped_by_index = df.groupby('column_a').apply(lambda n: print(n)).compute()

That printed a Dataframe with correct columns but the values are either "1", "foo" or "True". Example:

column_b  column_c column_d  column_e  column_f  column_g  \
index                                                                   
a          foo           1      foo        1           1           1

I also get the warning:

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: meta is not specified, inferred from partial data. Please provide meta if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8'))
for series result """Entry point for launching an IPython kernel.

What is going on here?

Upvotes: 7

Views: 2324

Answers (1)

mdurant
mdurant

Reputation: 28684

Indeed, if you read the docs for apply, you will see that meta= is a parameter that you can pass, which tells Dask how to expect the output of the operation to look. This is necessary because apply can do very general things.

If you don't supply meta=, as in your case, than Dask will try to seed the operation with an example mini-dataframe containing 1 for any numerical columns and "foo" for text ones, just to see what the output will be like. Since in your apply you print (and don't actually return anything), you are seeing this seed.

As suggested by the documentation, you are always better off providing meta= when possible, and then a whole step in the process can be avoided.

Upvotes: 7

Related Questions