Arco Bast
Arco Bast

Reputation: 3892

What is the return value of map_partitions?

The dask API says, that map_partition can be used to "apply a Python function on each DataFrame partition." From this description and according to the usual behaviour of "map", I would expect the return value of map_partitions to be (something like) a list whose length equals the number of partitions. Each element of the list should be one of the return values of the function calls.

However, with respect to the following code, I am not sure, what the return value depends on:

#generate example dataframe
pdf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(pdf, npartitions=3)

#define helper function for map. VAL is the return value
VAL = pd.Series({'A': 1})
#VAL = pd.DataFrame({'A': [1]}) #other return values used in this example
#VAL = None
#VAL = 1
def helper(x):
    print('function called\n')
    return VAL

#check result
out = ddf.map_partitions(helper).compute()
print(len(out))

Therefore, I want to ask some questions:

  1. how is the return value of map_partitions determined?
  2. What influences the number of function calls besides the number of partitions / What criteria has a function to fulfil to be called once with each partition?
  3. What should be the return value of a function, that only "does" something, i.e. a procedure?
  4. How should a function be designed, that returns arbitrary objects?

Upvotes: 8

Views: 5619

Answers (1)

MRocklin
MRocklin

Reputation: 57261

The Dask DataFrame.map_partitions function returns a new Dask Dataframe or Series, based on the output type of the mapped function. See the API documentation for a thorough explanation.

  1. How is the return value of map_partitions determined?

    See the API docs referred to above.

  2. What influences the number of function calls besides the number of partitions / What criteria has a function to fulfil to be called once with each partition?

    You're correct that we're calling it once immediately to guess the dtypes/columns of the output. You can avoid this by specifying a meta= keyword directly. Other than that the function is called once per partition.

  3. What should be the return value of a function, that only "does" something, i.e. a procedure?

    You could always return an empty dataframe. You might also want to consider transforming your dataframe into a sequence of dask.delayed objects, which are typically more often used for ad-hoc computations.

  4. How should a function be designed, that returns arbitrary objects?

    If your function doesn't return series/dataframes then I recommend converting your dataframe to a sequence of dask.delayed objects with the DataFrame.to_delayed method.

Upvotes: 5

Related Questions