user1700890
user1700890

Reputation: 7730

Understanding what map_partitions in dask does

I am trying to understand what map_partitions in dask does. Here is my example:

import dask.dataframe as dd
import pandas as pd
from dask.multiprocessing import get
import random

df = pd.DataFrame({'col_1':random.sample(range(10000), 100), 'col_2': random.sample(range(10000), 100) })

def test_f(df):
    print(df.col_1)
    print("------------")

ddf = dd.from_pandas(df, npartitions=8)

ddf['result'] = ddf.map_partitions(test_f ).compute(get=get)

And here is output:

0    1.0
1    1.0
Name: col_1, dtype: float64
------------

Why I don't get full print out of my dataframe? What does output mean?

Upvotes: 2

Views: 2839

Answers (1)

mdurant
mdurant

Reputation: 28673

map_partitions takes an optional meta= keyword, with which you can tell Dask how you expect the output of your function to look. This is generally a good idea, since it avoids Dask having to infer how the output looks, which can cause not insignificant work to occur.

In the absence of meta=, Dask will call your function first, to infer the output, and then for each partition. You are seeing the first of these. If you provide any meta=, you will only see the partitions. Obviously you'd want to provide the actual expected output template; but in your case the function doesn't actually return anything.

In order to avoid too much work just for inference, Dask uses typical dummy values. In this case, for each float column, the values of 1.0 are used, and there are more than one rows to ensure the input looks like a dataframe as opposed to a series.

Upvotes: 5

Related Questions