Reputation: 4475
I've read in around 15 csv files:
df = dd.read_csv("gs://project/*.csv", blocksize=25e6,
storage_options={'token': fs.session.credentials})
Then I persisted the Dataframe (it uses 7.33 GB memory):
df = df.persist()
I set a new index because I want my group by on that field to be as efficient as possible:
df = df.set_index('column_a').persist()
Now I have 181 divisions and 180 partitions. To try out how fast my group by was going I tried a custom apply function that just prints the Group Dataframe:
grouped_by_index = df.groupby('column_a').apply(lambda n: print(n)).compute()
That printed a Dataframe with correct columns but the values are either "1", "foo" or "True". Example:
column_b column_c column_d column_e column_f column_g \
index
a foo 1 foo 1 1 1
I also get the warning:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning:
meta
is not specified, inferred from partial data. Please providemeta
if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8'))
for series result """Entry point for launching an IPython kernel.
What is going on here?
Upvotes: 7
Views: 2324
Reputation: 28684
Indeed, if you read the docs for apply
, you will see that meta=
is a parameter that you can pass, which tells Dask how to expect the output of the operation to look. This is necessary because apply
can do very general things.
If you don't supply meta=
, as in your case, than Dask will try to seed the operation with an example mini-dataframe containing 1 for any numerical columns and "foo" for text ones, just to see what the output will be like. Since in your apply
you print (and don't actually return anything), you are seeing this seed.
As suggested by the documentation, you are always better off providing meta=
when possible, and then a whole step in the process can be avoided.
Upvotes: 7