Reputation: 1085
I want to fix my function in order that the expressions inside filter
will vary based on the arguments passed. For example if we pass eq_cols = {"a": "aaa", "b": "bbb"}
, then two expressions should be created and if not (the default), we don't pass those expressions. Same for isin_col
and select_cols
.
import polars as pl
from typing import Dict, List
def query_parquet(
file_path: str,
eq_cols: Dict[str, str] = None,
isin_cols: Dict[str, List[str]] = None,
select_cols: List[str] = None,
) -> pl.DataFrame:
query = (
pl.scan_parquet(file_path)
.filter(
pl.col("A") == "aaa",
# pl.col("B") == "bbb",
# pl.col("C").is_in(["ccc1", "ccc2"])
# pl.col("D").is_in(["ddd1", "ddd2"])
)
.select(["F", "G", "H"])
.collect()
)
return query
Can you guys show me how to do it ? I hope I explained my question good enough.
Upvotes: 2
Views: 903
Reputation: 2893
Expressions are Python objects and not evaluated when creating them, so you can do comprehensions like
pl.scan_parquet(file_path)
.filter(
pl.all_horizontal(
pl.all_horizontal(pl.col(k).eq(v) for k,v in eq_cols.items()),
pl.all_horizontal(pl.col(k).is_in(v) for k,v in isin_cols.items()),
)
)
.select(select_cols)
.collect()
And so on in any combining logic you wish, not just all_horizontal
necessarily. In the case of default None
, you could make the default some type of always True
expression or *
for select, or do some if conditionals when building the query over multiple lines instead of just a one-liner like so:
query = pl.scan_parquet(file_path)
if eq_cols or isin_cols:
filter_exprs = []
if eq_cols:
filter_exprs.append(pl.all_horizontal(pl.col(k).eq(v) for k,v in eq_cols.items())
if isin_cols:
filter_exprs.append(pl.all_horizontal(pl.col(k).is_in(v) for k,v in isin_cols.items())
query = query.filter(pl.all_horizontal(filter_exprs))
query = query.filter(filter_expr)
if select_cols:
query = query.select(select_cols)
query = query.collect() # done here
Addendum: One further option for eq_cols
is that in recent version of polars, filter
has become more ergonomic. The logic of pl.all_horizontal
is the default when passing in variadic positional arguments (you can even pass in a iterable straight up as one positional argument), and specific keyword arguments are equal to pl.col(k).eq(v)
. So you can unpack a dictionary (and optionally a list/generator, not needed though) to simplify the if statement above like so, given that the defaults are []
and {}
instead of None
s:
if eq_cols or isin_cols:
query = query.filter((pl.col(k).is_in(v) for k,v in isin_cols.items()), **eq_cols)
# can unpack the generator, e.g. *(pl.col(k) ... ), same results
Upvotes: 3