Does Polars have an idiomatic way to extract information from the middle of a lazy chain of expressions?

Question

I just started using Polars, and I love its lazy-chaining features! But I'm stuck on what I thought would be a simple pattern -- I want to chain several operations in sequence, pulling out some summary statistics after each operation. Here's a trivial example using Pandas:

df = pd.read_csv("my.csv")
l1 = len(df)
df = df[df.A != 0]
l2 = len(df)
print(f"{l1}, {l2}")

However, the dataset might be too big to fit in RAM, so I want to use a streaming LazyFrame instead of a DataFrame. What I find myself wanting to do is to express some kind of "branched" LazyFrame with multiple ".collect()" calls that all get evaluated at once.

I can see two ways that don't quite work. You could express this with two different collection operations, but this solution requires reading the CSV twice:

df = pl.scan_csv("my.csv")
l1 = df.select(pl.len()).collect().item()
l2 = df.filter(pl.col("A") != 0).select(pl.len()).collect().item() # <- Reading a second time, not efficient.
print(f"{l1}, {l2}")

Alternatively, you could "cache" the streamed dataframe, but this seems even sillier because the dataframe then has to sit in memory and you lose the benefit of streaming:

df = pl.scan_csv("my.csv").collect() # <- basically just not streaming
l1 = df.select(pl.len()).item()
l2 = df.filter(pl.col("A") != 0).select(pl.len()).item()
print(f"{l1}, {l2}")

Is there any way to collect both counts in streaming mode, without reading multiple times? And is there a general way to do "branched" operations like this? (Some of the things I want to do with intermediates are considerably more complex than just counting rows, so while a len()-specific answer would still be helpful, I'm really looking for a general solution.)

Does Polars have an idiomatic way to extract information from the middle of a lazy chain of expressions?

Answers (1)

Related Questions