sclamons
sclamons

Reputation: 73

Does Polars have an idiomatic way to extract information from the middle of a lazy chain of expressions?

I just started using Polars, and I love its lazy-chaining features! But I'm stuck on what I thought would be a simple pattern -- I want to chain several operations in sequence, pulling out some summary statistics after each operation. Here's a trivial example using Pandas:

df = pd.read_csv("my.csv")
l1 = len(df)
df = df[df.A != 0]
l2 = len(df)
print(f"{l1}, {l2}")

However, the dataset might be too big to fit in RAM, so I want to use a streaming LazyFrame instead of a DataFrame. What I find myself wanting to do is to express some kind of "branched" LazyFrame with multiple ".collect()" calls that all get evaluated at once.

I can see two ways that don't quite work. You could express this with two different collection operations, but this solution requires reading the CSV twice:

df = pl.scan_csv("my.csv")
l1 = df.select(pl.len()).collect().item()
l2 = df.filter(pl.col("A") != 0).select(pl.len()).collect().item() # <- Reading a second time, not efficient.
print(f"{l1}, {l2}")

Alternatively, you could "cache" the streamed dataframe, but this seems even sillier because the dataframe then has to sit in memory and you lose the benefit of streaming:

df = pl.scan_csv("my.csv").collect() # <- basically just not streaming
l1 = df.select(pl.len()).item()
l2 = df.filter(pl.col("A") != 0).select(pl.len()).item()
print(f"{l1}, {l2}")

Is there any way to collect both counts in streaming mode, without reading multiple times? And is there a general way to do "branched" operations like this? (Some of the things I want to do with intermediates are considerably more complex than just counting rows, so while a len()-specific answer would still be helpful, I'm really looking for a general solution.)

Upvotes: 2

Views: 196

Answers (1)

Dean MacGregor
Dean MacGregor

Reputation: 18331

In the general idiomatic sense, no there's (mostly) not a way to do what you want in the way you're doing with pandas.

In the specific sense of your example you could do

l1, l2 = df.select(
    pl.len(),
    (pl.col('A')!=0).sum()
).collect().rows()[0]

Other than that, you've got window functions which allow you to, essentially, run multiple group_bys in the same operation. You can, of course, do explicit joins before collecting so that you only collect once. There's also collect_all which allows you to collect multiple lazy frames at once, in parallel which would be like

res = pl.collect_all([
df.select(pl.len()),
df.filter(pl.col("A") != 0).select(pl.len())
])
l1, l2 = res[0].item(),  res[1].item()

This would read the file twice, it would just do it in parallel so it still isn't quite what you want but it's just another tool to get closer.

Upvotes: 2

Related Questions