Filter a polars LazyFrame column lowercase without materialize to DataFrame

Question

I am currently reading parquet files from a delta lake with pl.scan_delta(...) and want to filter the LazyFrame on lower case values of a string column.

The code below will load all in memory and kill my kernel.

lazy_df = (
    pl.scan_delta(...)    
    .filter(pl.col("partition_col") == "A")
    .filter(pl.col("parname").str.to_lowercase().is_in(['a', 'b', 'c']))
    .select(*columns)
)

lazy_df.collect()

But without the str expression, everything will fit into memory. It seems like it needs to read the whole column into memory to be able to run lowercase on it? Is that true?

lazy_df = (
    pl.scan_delta(...)    
    .filter(pl.col("partition_col") == "A")
    .filter(pl.col("parname").is_in(['A', 'a', 'b', 'C']))
    .select(*columns)
)

lazy_df.collect()

When running print(lazy_df.explain()) on the frames I get the following:

With str.to_lowercase() expression

FILTER [(col("parname").str.lowercase().is_in([Series])) & ([(col("partition_col")) == (Utf8(A))])] FROM

  PYTHON SCAN 
  PROJECT */16 COLUMNS

With exact comparison


  PYTHON SCAN 
  PROJECT */16 COLUMNS
  SELECTION: ((pa.compute.field('parname')).isin(["A","a","b","C"]) & (pa.compute.field('partition_col') == 'A'))

Also, how can I filter this frame on lower case values of a column in an memory efficient way?

Filter a polars LazyFrame column lowercase without materialize to DataFrame

Answers (1)

Related Questions