Is there a good way to do conditional select on a dask dataframe for many conditions?

Question

I'm switching from Pandas to Dask and want to do conditional select on a dataframe. I'd like to provide a list of conditions, preferably as boolean arrays/series and would then get a dataframe with all these conditions applied.

In Pandas, I just did np.all([BoolSeries1, BoolSeries2,...]) and applied the result to the dataframe.

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd

df  = pd.DataFrame({'A' : np.random.rand(1000) , 'B': np.random.rand(1000), 'C' : np.random.rand(1000) })
ddf = dd.from_pandas(df, npartitions=10)

cuts = [(ddf['A'] > 0.4), (ddf['B'] < 0.4)]
bool_ar = da.all(da.asarray([cut.compute() for cut in cuts]),axis=0).compute()
ddf = ddf.loc[bool_ar.to_dask_dataframe()]['C']

This works but is quite slow because I have to call .compute() twice.

I feel like there must be some better way to solve this, converting first to an array and then back to a dataframe feels really clunky.

MRocklin · Accepted Answer

You don't want to call .compute prematurely. This brings things out of Dask space and back into numpy/pandas, which makes it hard to align things again, and is also inefficient, instead I think that you're looking for the & operator

df  = pd.DataFrame({'A' : np.random.rand(1000) , 'B': np.random.rand(1000), 'C' : np.random.rand(1000) })
ddf = dd.from_pandas(df, npartitions=10)

df2 = df[(ddf['A'] > 0.4) & (ddf['B'] < 0.4)]

Every time you switch between dask dataframe and dask array or dask and numpy/pandas you introduce more complexity. It's best to stay within one system if you can. Things will be simpler.

You can extend this to an arbitrary number of conditions with a for loop.

conditions = [...]

cond = conditions[0]

for c in conditions[:1]:
    cond = cond & c

Is there a good way to do conditional select on a dask dataframe for many conditions?

Answers (2)

Related Questions