Reputation: 23
I'm switching from Pandas to Dask and want to do conditional select on a dataframe. I'd like to provide a list of conditions, preferably as boolean arrays/series and would then get a dataframe with all these conditions applied.
In Pandas, I just did np.all([BoolSeries1, BoolSeries2,...]) and applied the result to the dataframe.
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
df = pd.DataFrame({'A' : np.random.rand(1000) , 'B': np.random.rand(1000), 'C' : np.random.rand(1000) })
ddf = dd.from_pandas(df, npartitions=10)
cuts = [(ddf['A'] > 0.4), (ddf['B'] < 0.4)]
bool_ar = da.all(da.asarray([cut.compute() for cut in cuts]),axis=0).compute()
ddf = ddf.loc[bool_ar.to_dask_dataframe()]['C']
This works but is quite slow because I have to call .compute()
twice.
I feel like there must be some better way to solve this, converting first to an array and then back to a dataframe feels really clunky.
Upvotes: 2
Views: 2954
Reputation: 23
Okay, I think I've solved it.
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
import operator
from functools import reduce
df = pd.DataFrame({'A' : np.random.rand(1000) , 'B': np.random.rand(1000), 'C' : np.random.rand(1000) })
ddf = dd.from_pandas(df, npartitions=10)
cuts = [(ddf['A'] > 0.4), (ddf['B'] < 0.4)]
bool_arr = reduce(operator.and_, cuts)
ddf = ddf.loc[bool_arr]['C']
Using reduce and the and_ from the operator module solve my problem. Thanks for the help, everybody!
Upvotes: 0
Reputation: 57261
You don't want to call .compute
prematurely. This brings things out of Dask space and back into numpy/pandas, which makes it hard to align things again, and is also inefficient, instead I think that you're looking for the &
operator
df = pd.DataFrame({'A' : np.random.rand(1000) , 'B': np.random.rand(1000), 'C' : np.random.rand(1000) })
ddf = dd.from_pandas(df, npartitions=10)
df2 = df[(ddf['A'] > 0.4) & (ddf['B'] < 0.4)]
Every time you switch between dask dataframe and dask array or dask and numpy/pandas you introduce more complexity. It's best to stay within one system if you can. Things will be simpler.
You can extend this to an arbitrary number of conditions with a for loop.
conditions = [...]
cond = conditions[0]
for c in conditions[:1]:
cond = cond & c
Upvotes: 2