Reputation: 13
I'm trying to apply a function that returns a dataframe for all columns without looping. Is it possible to do it in polars? Thanks in advance.
Here's a simple example:
import polars as pl
import numpy as np
df = pl.DataFrame({
'A': np.random.randint(0, 10, 20),
'B': np.random.randint(20, 30, 20),
})
def get_rolling_mean(series, windows=[1,2,3]):
rolling_means = [series.rolling_mean(window).alias(f'{window}') for window in windows]
return pl.DataFrame(rolling_means)
## Line I'm trying to change to avoid looping since my dataframe is large.
df_rolling_means = [get_rolling_mean(df[col]) for col in df.columns]
The line I'm trying to optimize (if possible) is the final line. Thanks again.
Edit: Many thanks to @braaannigan for the answer. I have one more question related to this issue. Say that I have 2 dataframes and a function that generates signal:
import polars as pl
import numpy as np
df_high = pl.DataFrame({
'A': np.random.randint(0, 10, 20),
'B': np.random.randint(20, 30, 20),})
df_low = pl.DataFrame({
'A': np.random.randint(0, 10, 20),
'B': np.random.randint(20, 30, 20),})
def get_signal(high_series, low_series, params):
"do some calculation (output will also be a polars series)"
return signal
PARAMS = ...
#Currently what I'm doing is
signals = [get_signal(df_high[col], df_low[col], PARAMS) for col in df_high.columns]
Not sure how to get around looping here since I have two dataframes not just one,(should I just use apply/map? But I think that will require looping which I'm trying to avoid if possible). Thanks in advance again
Upvotes: 1
Views: 302
Reputation: 874
Thanks for a clear example!
We can apply an expression on multiple columns in a few ways. For example we can apply it to all columns with pl.all()
We can then add a suffix onto the output column names with .name.suffix
window = 1
df.select(pl.all().rolling_mean(window_size = window).name.suffix(f"_{window}"))
For the full solution we define a list of window sizes and we loop over this inside the expression. As the loop is inside the expression Polars will do the columns and windows in parallel
windows = [1,2,3]
df.select(pl.all().rolling_mean(window_size = window).name.suffix(f"_{window}") for window in windows)
shape: (20, 6)
┌─────┬──────┬──────┬──────┬──────────┬───────────┐
│ A_1 ┆ B_1 ┆ A_2 ┆ B_2 ┆ A_3 ┆ B_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════╪══════╪══════╪══════════╪═══════════╡
│ 0.0 ┆ 22.0 ┆ null ┆ null ┆ null ┆ null │
│ 6.0 ┆ 29.0 ┆ 3.0 ┆ 25.5 ┆ null ┆ null │
│ 0.0 ┆ 24.0 ┆ 3.0 ┆ 26.5 ┆ 2.0 ┆ 25.0 │
│ 0.0 ┆ 29.0 ┆ 0.0 ┆ 26.5 ┆ 2.0 ┆ 27.333333 │
│ 8.0 ┆ 20.0 ┆ 4.0 ┆ 24.5 ┆ 2.666667 ┆ 24.333333 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 9.0 ┆ 24.0 ┆ 5.5 ┆ 22.0 ┆ 6.0 ┆ 23.666667 │
│ 6.0 ┆ 24.0 ┆ 7.5 ┆ 24.0 ┆ 5.666667 ┆ 22.666667 │
│ 3.0 ┆ 20.0 ┆ 4.5 ┆ 22.0 ┆ 6.0 ┆ 22.666667 │
│ 2.0 ┆ 25.0 ┆ 2.5 ┆ 22.5 ┆ 3.666667 ┆ 23.0 │
│ 6.0 ┆ 26.0 ┆ 4.0 ┆ 25.5 ┆ 3.666667 ┆ 23.666667 │
└─────┴──────┴──────┴──────┴──────────┴───────────┘
Upvotes: 4