J Griffiths
J Griffiths

Reputation: 51

Why is Polars streaming engine so slow?

I have a directory of parquet files and am looking to apply a function to all of them and take an average.

At first I thought that Polars would excel at this with LazyFrame and streaming functionality, but it seems a naive approach of looping through chunks of data and loading iteratively outperforms. Is there something wrong with my implementation that is causing this?

%%time
unique_years = (
    pl.scan_parquet(data_glob)
    .select(pl.col("time").dt.year().unique())
    .collect()
    .to_series()
    .to_list()
)

results = []
for year in tqdm(unique_years):
    results.append(
        pl.scan_parquet(data_glob)
        .filter(pl.col("time").dt.year()==year)
        .select(["time", "market", "close"])
        .sort("time")
        .with_columns(pl.col("close").log().diff().over("market").alias("log_returns"))
        .group_by("time")
        .agg(pl.col("log_returns").mean())
        .collect()
    )
CPU times: user 10min 10s, sys: 4min 44s, total: 14min 55s
Wall time: 2min 4s
%%time
df = (
    pl.scan_parquet(data_glob)
    .select(["time", "market", "close"])
    .sort("time")
    .with_columns(pl.col("close").log().diff().over("market").alias("log_returns"))
    .group_by("time")
    .agg(pl.col("log_returns").mean())
    .collect(streaming=True)
)

The first approach, which I assumed to be incorrect, finishes relatively quickly, whereas the second does not execute within any reasonable amount of time. It seems odd that the streaming engine is so inefficient that simple looping can outperform it.

Upvotes: 0

Views: 1748

Answers (1)

Adrian Fletcher
Adrian Fletcher

Reputation: 170

The sort, diff, over, and mean functions are not supported by the streaming engine. Like @ritchie46 mentions, the window functions are the expensive functions here.

This article tests most of the functions in Polars and draws some conclusions about what the streaming engine can and cannot do. Note: these are done on a pair of floating point columns. Either way, if you look at the plan of your query, you will see which parts are not getting included in the streaming engine and go from there. You can use the explain(streaming=True) command to see what is run outside the --- STREAMING tags.

Upvotes: 4

Related Questions