Reputation: 7730
I am trying to understand what map_partitions
in dask
does. Here is my example:
import dask.dataframe as dd
import pandas as pd
from dask.multiprocessing import get
import random
df = pd.DataFrame({'col_1':random.sample(range(10000), 100), 'col_2': random.sample(range(10000), 100) })
def test_f(df):
print(df.col_1)
print("------------")
ddf = dd.from_pandas(df, npartitions=8)
ddf['result'] = ddf.map_partitions(test_f ).compute(get=get)
And here is output:
0 1.0
1 1.0
Name: col_1, dtype: float64
------------
Why I don't get full print out of my dataframe? What does output mean?
Upvotes: 2
Views: 2839
Reputation: 28673
map_partitions
takes an optional meta=
keyword, with which you can tell Dask how you expect the output of your function to look. This is generally a good idea, since it avoids Dask having to infer how the output looks, which can cause not insignificant work to occur.
In the absence of meta=
, Dask will call your function first, to infer the output, and then for each partition. You are seeing the first of these. If you provide any meta=
, you will only see the partitions. Obviously you'd want to provide the actual expected output template; but in your case the function doesn't actually return anything.
In order to avoid too much work just for inference, Dask uses typical dummy values. In this case, for each float column, the values of 1.0
are used, and there are more than one rows to ensure the input looks like a dataframe as opposed to a series.
Upvotes: 5