Reputation: 7387
I want to apply a mapping on a DataFrame column. With Pandas this is straight forward:
df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap))
This writes the infos
column, based on the custom_map
function, and uses the rows in numbers for the lambda
statement.
With dask this isn't that simple. ddf
is a dask DataFrame. map_partitions
is the equivalent to parallel execution of the mapping on a part of the DataFrame.
This does not work because you don't define columns like that in dask.
ddf["infos"] = ddf2["numbers"].map_partitions(lambda nr: custom_map(nr, hashmap))
Does anyone know how I can use columns here? I don't understand their API documentation at all.
Upvotes: 7
Views: 15636
Reputation: 57271
You can use the .map method, exactly as in Pandas
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({'x': [1, 2, 3]})
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: df.x.map(lambda x: x + 1)
Out[5]:
0 2
1 3
2 4
Name: x, dtype: int64
In [6]: ddf.x.map(lambda x: x + 1).compute()
Out[6]:
0 2
1 3
2 4
Name: x, dtype: int64
You may be asked to provide a meta=
keyword. This lets dask.dataframe know the output name and type of your function. Copying the docstring from map_partitions
here:
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty pd.DataFrame or pd.Series that matches the dtypes and
column names of the output. This metadata is necessary for many
algorithms in dask dataframe to work. For ease of use, some
alternative inputs are also available. Instead of a DataFrame,
a dict of {name: dtype} or iterable of (name, dtype) can be
provided. Instead of a series, a tuple of (name, dtype) can be
used. If not provided, dask will try to infer the metadata.
This may lead to unexpected results, so providing meta is
recommended.
For more information, see dask.dataframe.utils.make_meta.
So in the example above, where my output will be a series with name 'x'
and dtype int
I can do either of the following to be more explicit
>>> ddf.x.map(lambda x: x + 1, meta=('x', int))
or
>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))
This tells dask.dataframe what to expect from our function. If no meta is given then dask.dataframe will try running your function on a little piece of data. It will raise an error asking for help if this fails.
Upvotes: 18