Wazir Kahar
Wazir Kahar

Reputation: 13

How to apply function that returns a dataframe for all columns without looping through each column in Polars?

I'm trying to apply a function that returns a dataframe for all columns without looping. Is it possible to do it in polars? Thanks in advance.

Here's a simple example:

import polars as pl
import numpy as np

df = pl.DataFrame({
    'A': np.random.randint(0, 10, 20),
    'B': np.random.randint(20, 30, 20),
})

def get_rolling_mean(series, windows=[1,2,3]):
    rolling_means = [series.rolling_mean(window).alias(f'{window}') for window in windows]
    return pl.DataFrame(rolling_means)

## Line I'm trying to change to avoid looping since my dataframe is large.
df_rolling_means = [get_rolling_mean(df[col]) for col in df.columns]

The line I'm trying to optimize (if possible) is the final line. Thanks again.

Edit: Many thanks to @braaannigan for the answer. I have one more question related to this issue. Say that I have 2 dataframes and a function that generates signal:

import polars as pl
import numpy as np
df_high = pl.DataFrame({
'A': np.random.randint(0, 10, 20),
'B': np.random.randint(20, 30, 20),})

df_low = pl.DataFrame({
'A': np.random.randint(0, 10, 20),
'B': np.random.randint(20, 30, 20),})
def get_signal(high_series, low_series, params):
   "do some calculation (output will also be a polars series)"
   return signal

PARAMS = ...

#Currently what I'm doing is 
signals = [get_signal(df_high[col], df_low[col], PARAMS) for col in df_high.columns]

Not sure how to get around looping here since I have two dataframes not just one,(should I just use apply/map? But I think that will require looping which I'm trying to avoid if possible). Thanks in advance again

Upvotes: 1

Views: 302

Answers (1)

braaannigan
braaannigan

Reputation: 874

Thanks for a clear example!

We can apply an expression on multiple columns in a few ways. For example we can apply it to all columns with pl.all()

We can then add a suffix onto the output column names with .name.suffix

window = 1
df.select(pl.all().rolling_mean(window_size = window).name.suffix(f"_{window}"))

For the full solution we define a list of window sizes and we loop over this inside the expression. As the loop is inside the expression Polars will do the columns and windows in parallel

windows = [1,2,3]
df.select(pl.all().rolling_mean(window_size = window).name.suffix(f"_{window}") for window in windows)
shape: (20, 6)
┌─────┬──────┬──────┬──────┬──────────┬───────────┐
│ A_1 ┆ B_1  ┆ A_2  ┆ B_2  ┆ A_3      ┆ B_3       │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---      ┆ ---       │
│ f64 ┆ f64  ┆ f64  ┆ f64  ┆ f64      ┆ f64       │
╞═════╪══════╪══════╪══════╪══════════╪═══════════╡
│ 0.0 ┆ 22.0 ┆ null ┆ null ┆ null     ┆ null      │
│ 6.0 ┆ 29.0 ┆ 3.0  ┆ 25.5 ┆ null     ┆ null      │
│ 0.0 ┆ 24.0 ┆ 3.0  ┆ 26.5 ┆ 2.0      ┆ 25.0      │
│ 0.0 ┆ 29.0 ┆ 0.0  ┆ 26.5 ┆ 2.0      ┆ 27.333333 │
│ 8.0 ┆ 20.0 ┆ 4.0  ┆ 24.5 ┆ 2.666667 ┆ 24.333333 │
│ …   ┆ …    ┆ …    ┆ …    ┆ …        ┆ …         │
│ 9.0 ┆ 24.0 ┆ 5.5  ┆ 22.0 ┆ 6.0      ┆ 23.666667 │
│ 6.0 ┆ 24.0 ┆ 7.5  ┆ 24.0 ┆ 5.666667 ┆ 22.666667 │
│ 3.0 ┆ 20.0 ┆ 4.5  ┆ 22.0 ┆ 6.0      ┆ 22.666667 │
│ 2.0 ┆ 25.0 ┆ 2.5  ┆ 22.5 ┆ 3.666667 ┆ 23.0      │
│ 6.0 ┆ 26.0 ┆ 4.0  ┆ 25.5 ┆ 3.666667 ┆ 23.666667 │
└─────┴──────┴──────┴──────┴──────────┴───────────┘

Upvotes: 4

Related Questions