Reputation: 2123
In Pandas if I want to create a column of conditional dummies (say 1 if a variable is equal to a string and 0 if it is not), then my goto in pandas is:
data["ebt_dummy"] = np.where((data["paymenttypeid"]=='ebt'), 1, 0)
Naively trying this in a dask dataframe throws an error. Following the directions in the documentation for map_partitions also throws an error:
data = data.map_partitions(lambda df: df.assign(ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)), meta={'paymenttypeid': 'str', 'ebt_dummy': 'i8'})
What is a good way, or the most Dask-thonic way, of doing this?
Upvotes: 3
Views: 449
Reputation: 1
I believe what you're looking for is a ternary operation. For numerics, something like this should work.
import dask.dataframe as dd
import typing as t
def ternary(conditional: dd.Series, option_true: t.Union[float, int], option_false: t.Union[float, int]) -> dd.Series:
return conditional * option_true + (~conditional) * option_false
data["ebt_dummy"] = ternary(data["paymenttypeid"]=='ebt', 1, 0)
Upvotes: 0
Reputation: 2119
This also worked for me:
data['ebt_dummy'] = dd.from_array(np.where((df["paymenttypeid"]=='ebt'), 1, 0))
Upvotes: 0
Reputation: 11895
Here's some sample data to play with:
In [1]:
df = pd.DataFrame(np.transpose([np.random.choice(['ebt','other'], (10)),
np.random.rand(10)]), columns=['paymenttypeid','other'])
df
Out[1]:
paymenttypeid other
0 other 0.3130770966143612
1 other 0.5167434068096931
2 ebt 0.7606898392115471
3 ebt 0.9424572692382547
4 ebt 0.624282017575857
5 ebt 0.8584841824784487
6 other 0.5017083765654611
7 other 0.025994123211164233
8 ebt 0.07045354449612984
9 ebt 0.11976351556850084
Let's convert this to a dataframe
In [2]: data = dd.from_pandas(df, npartitions=2)
and use apply
(on a Series) to assign:
In [3]:
data['ebt_dummy'] = data.paymenttypeid.apply(lambda x: 1 if x =='ebt' else 0, meta=('paymenttypeid', 'str'))
data.compute()
Out [3]:
paymenttypeid other ebt_dummy
0 other 0.3130770966143612 0
1 other 0.5167434068096931 0
2 ebt 0.7606898392115471 1
3 ebt 0.9424572692382547 1
4 ebt 0.624282017575857 1
5 ebt 0.8584841824784487 1
6 other 0.5017083765654611 0
7 other 0.025994123211164233 0
8 ebt 0.07045354449612984 1
9 ebt 0.11976351556850084 1
Update:
It seems that the meta
you pass is the problem, since this works:
data = data.map_partitions(lambda df: df.assign(
ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)))
data.compute()
In my example, if I wanted to specify the meta
, I would have to pass the dtypes of the current data
, not the one I expect once I assign:
data.map_partitions(lambda df: df.assign(
ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)),
meta={'paymenttypeid': 'str', 'other': 'float64'})
Upvotes: 1