Chuck Daniels
Chuck Daniels

Reputation: 324

Can I use a pandas-like string expression for filtering a DataFrame?

I am considering replacing my use of pandas with polars in a tool that allows users to input predicate expressions for filtering/subsetting data rows. This allows users to use expressions that the pandas.DataFrame.query method can parse, such as "x > 1", as a very simple example.

However, I can't seem to find a way to use the same types of string expressions with polars.DataFrame.filter so that I can swap out pandas for polars without requiring users to change their predicate expressions.

The only thing I've found that's close to my question is this posting: String as a condition in a filter

Unfortunately, that's not quite what I need, as it still requires a string expression like "pl.col('x') > 1" rather than simply "x > 1".

Is there a way to use the simpler ("agnostic") syntax with polars?

Using the example from the polars.DataFrame.filter docs:

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )

When calling df.filter, I'm forced to use expressions like the following:

pl.col("foo") < 3
(pl.col("foo") < 3) & (pl.col("ham") == "a")

However, I want to be able to use the following string expressions instead, respectively, so that the users of the tool (currently using pandas) do not have to be aware of the polars-specific syntax (thus allowing me to swap libraries without impacting users):

"foo < 3"
"foo < 3 & ham == 'a'"

When I attempt to do so, here's what happens, which is puzzling since str is one of the supported types for the predicate argument, so it is unclear as to the syntax supported for str predicates since the docs do not show any examples of such:

>>> df.filter("foo < 3")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 2565, in filter
    self.lazy()
  File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/utils.py", line 391, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 1165, in collect
    return pli.wrap_df(ldf.collect())
exceptions.NotFoundError: foo < 3

What I was expecting was the same return value that df.filter(pl.col("foo") < 3) would return.

Upvotes: 3

Views: 977

Answers (1)

ritchie46
ritchie46

Reputation: 14630

You could try to use the pl.sql for that.

import polars as pl

df = pl.DataFrame(
    {
        "foo": [1, 2, 3],
        "bar": [6, 7, 8],
        "ham": ["a", "b", "c"],
    }
)

string_expr = "foo < 3 and ham = 'a'"

pl.sql(f"""
SELECT * FROM df
WHERE {string_expr}
""").collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

Note that the SQL language doesn't use the bitwise & nor equality == the same way as pandas, so you might need to replace & with and and == with =.

Upvotes: 4

Related Questions