VERBOSE
VERBOSE

Reputation: 1085

How to filter a lazy polars dataframe with arbitrary expressions?

I want to fix my function in order that the expressions inside filter will vary based on the arguments passed. For example if we pass eq_cols = {"a": "aaa", "b": "bbb"}, then two expressions should be created and if not (the default), we don't pass those expressions. Same for isin_col and select_cols.

import polars as pl
from typing import Dict, List

def query_parquet(
        file_path: str,
        eq_cols: Dict[str, str] = None,
        isin_cols: Dict[str, List[str]] = None,
        select_cols: List[str] = None,
    ) -> pl.DataFrame:

    query = (
        pl.scan_parquet(file_path)
        .filter(
            pl.col("A") == "aaa",
            # pl.col("B") == "bbb",
            # pl.col("C").is_in(["ccc1", "ccc2"])
            # pl.col("D").is_in(["ddd1", "ddd2"])
        )
        .select(["F", "G", "H"])
        .collect()
    )

    return query

Can you guys show me how to do it ? I hope I explained my question good enough.

Upvotes: 2

Views: 903

Answers (1)

Wayoshi
Wayoshi

Reputation: 2893

Expressions are Python objects and not evaluated when creating them, so you can do comprehensions like

pl.scan_parquet(file_path)
.filter(
  pl.all_horizontal(
    pl.all_horizontal(pl.col(k).eq(v) for k,v in eq_cols.items()),
    pl.all_horizontal(pl.col(k).is_in(v) for k,v in isin_cols.items()),
  )
)
.select(select_cols)
.collect()

And so on in any combining logic you wish, not just all_horizontal necessarily. In the case of default None, you could make the default some type of always True expression or * for select, or do some if conditionals when building the query over multiple lines instead of just a one-liner like so:

query = pl.scan_parquet(file_path)

if eq_cols or isin_cols:
  filter_exprs = []
  if eq_cols:
    filter_exprs.append(pl.all_horizontal(pl.col(k).eq(v) for k,v in eq_cols.items())
  if isin_cols:
    filter_exprs.append(pl.all_horizontal(pl.col(k).is_in(v) for k,v in isin_cols.items())
  query = query.filter(pl.all_horizontal(filter_exprs))

query = query.filter(filter_expr)
if select_cols:
  query = query.select(select_cols)

query = query.collect() # done here

Addendum: One further option for eq_cols is that in recent version of polars, filter has become more ergonomic. The logic of pl.all_horizontal is the default when passing in variadic positional arguments (you can even pass in a iterable straight up as one positional argument), and specific keyword arguments are equal to pl.col(k).eq(v). So you can unpack a dictionary (and optionally a list/generator, not needed though) to simplify the if statement above like so, given that the defaults are [] and {} instead of Nones:

if eq_cols or isin_cols:
  query = query.filter((pl.col(k).is_in(v) for k,v in isin_cols.items()), **eq_cols)
  # can unpack the generator, e.g. *(pl.col(k) ... ), same results

Upvotes: 3

Related Questions