Reputation: 73
I just started using Polars, and I love its lazy-chaining features! But I'm stuck on what I thought would be a simple pattern -- I want to chain several operations in sequence, pulling out some summary statistics after each operation. Here's a trivial example using Pandas:
df = pd.read_csv("my.csv")
l1 = len(df)
df = df[df.A != 0]
l2 = len(df)
print(f"{l1}, {l2}")
However, the dataset might be too big to fit in RAM, so I want to use a streaming LazyFrame instead of a DataFrame. What I find myself wanting to do is to express some kind of "branched" LazyFrame with multiple ".collect()" calls that all get evaluated at once.
I can see two ways that don't quite work. You could express this with two different collection operations, but this solution requires reading the CSV twice:
df = pl.scan_csv("my.csv")
l1 = df.select(pl.len()).collect().item()
l2 = df.filter(pl.col("A") != 0).select(pl.len()).collect().item() # <- Reading a second time, not efficient.
print(f"{l1}, {l2}")
Alternatively, you could "cache" the streamed dataframe, but this seems even sillier because the dataframe then has to sit in memory and you lose the benefit of streaming:
df = pl.scan_csv("my.csv").collect() # <- basically just not streaming
l1 = df.select(pl.len()).item()
l2 = df.filter(pl.col("A") != 0).select(pl.len()).item()
print(f"{l1}, {l2}")
Is there any way to collect both counts in streaming mode, without reading multiple times? And is there a general way to do "branched" operations like this? (Some of the things I want to do with intermediates are considerably more complex than just counting rows, so while a len()
-specific answer would still be helpful, I'm really looking for a general solution.)
Upvotes: 2
Views: 196
Reputation: 18331
In the general idiomatic sense, no there's (mostly) not a way to do what you want in the way you're doing with pandas.
In the specific sense of your example you could do
l1, l2 = df.select(
pl.len(),
(pl.col('A')!=0).sum()
).collect().rows()[0]
Other than that, you've got window functions
which allow you to, essentially, run multiple group_by
s in the same operation. You can, of course, do explicit joins before collecting so that you only collect once. There's also collect_all
which allows you to collect multiple lazy frames at once, in parallel which would be like
res = pl.collect_all([
df.select(pl.len()),
df.filter(pl.col("A") != 0).select(pl.len())
])
l1, l2 = res[0].item(), res[1].item()
This would read the file twice, it would just do it in parallel so it still isn't quite what you want but it's just another tool to get closer.
Upvotes: 2