Reputation: 11
I am trying to filter a large dask dataframe e.g. dd_test with multipe criterias at once.
I have created multiple filters for the dask dd: e.g.
filter1 = ~dd_test['size']<0
filter2 = ~dd_test['obs']<100
filter3 = dd_test['freq']>1
...
filter33 = ~ dd_test['price']<0
To apply the filters to the dask datafame, I am executing the following code (which works well):
sub_selection = dd_test[filter1][filter2][filter3]...[filter33].compute()
However, if certain filters are commented out as they are not needed sometimes, it is very annoying to constantly rewrite the sub_selection to e.g.
sub_selelction = dd_test[filter1][filter3][filter9]...[filter27].compute()
Thus the idea was to put all active filters into a list before filtering. The Idea was the following:
Collect all "active" filters in a list via (which works well):
selection_list = extend(value for name, value in sorted(locals().items(), key=lambda item: item[0]) if name.startswith('filter'))
However, I am not able to apply the list of 22 filters to the dask dataframe. I have tried several methods without success e.g.:
1. sub_selection = dd_test[selection_mask].compute()
2. sub_selection = dd_test[np.logical_and.reduce(selection_mask)].compute()
3. sub_selection = dd_test[pd.Series(np.logical_and.reduce(selection_mask))].compute()
ValueError: Item wrong length 22 instead of 0.
ValueError: Item wrong length 279740 instead of 0.
Does someone has an idea/solution on how to proceed? Much appreciated.
Upvotes: 1
Views: 1549
Reputation: 331
I think what I would do is define the filters utilizing element wise AND logic, so that they are combined to a single filter:
filter = (
(~dd_test['size'] < 0)
& (~dd_test['obs'] < 100)
& (dd_test['freq'] > 1)
...
& (~dd_test['price'] < 0)
)
sub_selection = dd_test[filter].compute()
Now you can comment out each filter (except for the first) without any problems
Upvotes: 3