Penguindex
Penguindex

Reputation: 11

Apply multiple filters to dask dataframe

I am trying to filter a large dask dataframe e.g. dd_test with multipe criterias at once.

I have created multiple filters for the dask dd: e.g.

filter1 = ~dd_test['size']<0
filter2 = ~dd_test['obs']<100
filter3 = dd_test['freq']>1
...
filter33 = ~ dd_test['price']<0

To apply the filters to the dask datafame, I am executing the following code (which works well):

sub_selection = dd_test[filter1][filter2][filter3]...[filter33].compute()

However, if certain filters are commented out as they are not needed sometimes, it is very annoying to constantly rewrite the sub_selection to e.g.

sub_selelction = dd_test[filter1][filter3][filter9]...[filter27].compute()

Thus the idea was to put all active filters into a list before filtering. The Idea was the following:

Collect all "active" filters in a list via (which works well):

selection_list = extend(value for name, value in sorted(locals().items(), key=lambda item: item[0]) if name.startswith('filter'))

However, I am not able to apply the list of 22 filters to the dask dataframe. I have tried several methods without success e.g.:

1. sub_selection = dd_test[selection_mask].compute()
2. sub_selection = dd_test[np.logical_and.reduce(selection_mask)].compute()
3. sub_selection = dd_test[pd.Series(np.logical_and.reduce(selection_mask))].compute()
  1. leads to ValueError: Item wrong length 22 instead of 0.
  2. leads to ValueError: Item wrong length 279740 instead of 0.
  3. returns a result but filters are not applied correctly as output is not as expected

Does someone has an idea/solution on how to proceed? Much appreciated.

Upvotes: 1

Views: 1549

Answers (1)

McToel
McToel

Reputation: 331

I think what I would do is define the filters utilizing element wise AND logic, so that they are combined to a single filter:

filter = (
    (~dd_test['size'] < 0)
    & (~dd_test['obs'] < 100)
    & (dd_test['freq'] > 1)
    ...
    & (~dd_test['price'] < 0)
    )

sub_selection = dd_test[filter].compute()

Now you can comment out each filter (except for the first) without any problems

Upvotes: 3

Related Questions