sfortney
sfortney

Reputation: 2123

Dask + Pandas: Returning a sequence of conditional dummies

In Pandas if I want to create a column of conditional dummies (say 1 if a variable is equal to a string and 0 if it is not), then my goto in pandas is:

data["ebt_dummy"] = np.where((data["paymenttypeid"]=='ebt'), 1, 0)

Naively trying this in a dask dataframe throws an error. Following the directions in the documentation for map_partitions also throws an error:

data = data.map_partitions(lambda df: df.assign(ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)),  meta={'paymenttypeid': 'str', 'ebt_dummy': 'i8'})

What is a good way, or the most Dask-thonic way, of doing this?

Upvotes: 3

Views: 449

Answers (3)

J Levidy
J Levidy

Reputation: 1

I believe what you're looking for is a ternary operation. For numerics, something like this should work.

import dask.dataframe as dd
import typing as t
def ternary(conditional: dd.Series, option_true: t.Union[float, int], option_false: t.Union[float, int]) -> dd.Series:
    return conditional * option_true + (~conditional) * option_false

data["ebt_dummy"] = ternary(data["paymenttypeid"]=='ebt', 1, 0)

Upvotes: 0

tmsss
tmsss

Reputation: 2119

This also worked for me:

data['ebt_dummy'] = dd.from_array(np.where((df["paymenttypeid"]=='ebt'), 1, 0))

Upvotes: 0

Julien Marrec
Julien Marrec

Reputation: 11895

Here's some sample data to play with:

In [1]:
df = pd.DataFrame(np.transpose([np.random.choice(['ebt','other'], (10)),
              np.random.rand(10)]), columns=['paymenttypeid','other'])

df

Out[1]:

  paymenttypeid                 other
0         other    0.3130770966143612
1         other    0.5167434068096931
2           ebt    0.7606898392115471
3           ebt    0.9424572692382547
4           ebt     0.624282017575857
5           ebt    0.8584841824784487
6         other    0.5017083765654611
7         other  0.025994123211164233
8           ebt   0.07045354449612984
9           ebt   0.11976351556850084

Let's convert this to a dataframe

In [2]: data = dd.from_pandas(df, npartitions=2)

and use apply(on a Series) to assign:

In [3]:
data['ebt_dummy'] = data.paymenttypeid.apply(lambda x: 1 if x =='ebt' else 0, meta=('paymenttypeid', 'str'))
data.compute()

Out [3]:
  paymenttypeid                 other  ebt_dummy
0         other    0.3130770966143612          0
1         other    0.5167434068096931          0
2           ebt    0.7606898392115471          1
3           ebt    0.9424572692382547          1
4           ebt     0.624282017575857          1
5           ebt    0.8584841824784487          1
6         other    0.5017083765654611          0
7         other  0.025994123211164233          0
8           ebt   0.07045354449612984          1
9           ebt   0.11976351556850084          1

Update:

It seems that the meta you pass is the problem, since this works:

data = data.map_partitions(lambda df: df.assign(
                                    ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)))

data.compute()

In my example, if I wanted to specify the meta, I would have to pass the dtypes of the current data, not the one I expect once I assign:

data.map_partitions(lambda df: df.assign(
                                    ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)), 
               meta={'paymenttypeid': 'str', 'other': 'float64'})

Upvotes: 1

Related Questions