sognetic
sognetic

Reputation: 23

Is there a good way to do conditional select on a dask dataframe for many conditions?

I'm switching from Pandas to Dask and want to do conditional select on a dataframe. I'd like to provide a list of conditions, preferably as boolean arrays/series and would then get a dataframe with all these conditions applied.

In Pandas, I just did np.all([BoolSeries1, BoolSeries2,...]) and applied the result to the dataframe.

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd

df  = pd.DataFrame({'A' : np.random.rand(1000) , 'B': np.random.rand(1000), 'C' : np.random.rand(1000) })
ddf = dd.from_pandas(df, npartitions=10)

cuts = [(ddf['A'] > 0.4), (ddf['B'] < 0.4)]
bool_ar = da.all(da.asarray([cut.compute() for cut in cuts]),axis=0).compute()
ddf = ddf.loc[bool_ar.to_dask_dataframe()]['C']

This works but is quite slow because I have to call .compute() twice.

I feel like there must be some better way to solve this, converting first to an array and then back to a dataframe feels really clunky.

Upvotes: 2

Views: 2954

Answers (2)

sognetic
sognetic

Reputation: 23

Okay, I think I've solved it.

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
import operator
from functools import reduce

df  = pd.DataFrame({'A' : np.random.rand(1000) , 'B': np.random.rand(1000), 'C' : np.random.rand(1000) })
ddf = dd.from_pandas(df, npartitions=10)

cuts = [(ddf['A'] > 0.4), (ddf['B'] < 0.4)]

bool_arr = reduce(operator.and_, cuts)
ddf = ddf.loc[bool_arr]['C']

Using reduce and the and_ from the operator module solve my problem. Thanks for the help, everybody!

Upvotes: 0

MRocklin
MRocklin

Reputation: 57261

You don't want to call .compute prematurely. This brings things out of Dask space and back into numpy/pandas, which makes it hard to align things again, and is also inefficient, instead I think that you're looking for the & operator

df  = pd.DataFrame({'A' : np.random.rand(1000) , 'B': np.random.rand(1000), 'C' : np.random.rand(1000) })
ddf = dd.from_pandas(df, npartitions=10)

df2 = df[(ddf['A'] > 0.4) & (ddf['B'] < 0.4)]

Every time you switch between dask dataframe and dask array or dask and numpy/pandas you introduce more complexity. It's best to stay within one system if you can. Things will be simpler.

You can extend this to an arbitrary number of conditions with a for loop.

conditions = [...]

cond = conditions[0]

for c in conditions[:1]:
    cond = cond & c

Upvotes: 2

Related Questions