mindoverflow
mindoverflow

Reputation: 924

Python (Polars): Vectorized operation of determining current solution with the use of previous variables

Let's say we have 3 variables a, b & c.

There are n instances of each, and all but the first instance of c are null.

We are to calculate each next c based on a given formula comprising of only present variables on the right hand side:

c = [(1 + a) * (current_c) * (b)] + [(1 + b) * (current_c) * (a)]

How do we go about this calculation without using native python looping? I've tried:

to no avail. It's always the case that _the shift has already been fully made at once. I thought perhaps the most plausible way to do this would be either rolling by 1 with grouping by 2, or via pl.int_range(...) and using the current index column number as the shift value. However, these keep failing as I am unable to properly come up with the correct syntax - I'm unable to pass the index column value and have polars accept it as a number. Even casting throws the same errors. Right now I am thinking we could manage another row for shifting and passing values back to row c, but then again, I'm not sure if this would even be an efficient way to go about it...

What would be the most optimal way to go about this without offloading to Rust?

Code for reference:

import polars as pl

if __name__ == "__main__":
    initial_c_value = 3

    df = pl.DataFrame(((2, 3, 4, 5, 8), (3, 7, 4, 9, 2)), schema=('a', 'b'))
    df = df.with_row_index('i', 1).with_columns(pl.lit(None).alias('c'))

    df = df.with_columns(pl.when(pl.col('i') == 1)
    .then(
        (((1 + pl.col('a')) * (initial_c_value) * (pl.col('b'))) +
        ((1 + pl.col('b')) * (initial_c_value) * (pl.col('a')))).alias('c'))
    .otherwise(
        ((1 + pl.col('a')) * (pl.col('c').shift(1)) * (pl.col('b'))) +
        ((1 + pl.col('b')) * (pl.col('c').shift(1)) * (pl.col('a')))).shift(1).alias('c'))

    print(df)

Upvotes: 3

Views: 451

Answers (2)

roman
roman

Reputation: 117540

Unfortunately, there's no reduce operation which would work vertically in polars. You can use cumulative_eval() but it operates on the whole window so you'd need to recalculate all the elements every time.

One way of doing it would be to use common table expressions of duckdb:

import duckdb

duckdb.sql(f"""
with recursive cte(i,a,b,c) as (
    select
        i, a, b,
        (1 + a) * ({initial_c_value} * b) + (1 + b) * ({initial_c_value} * a) as c
    from df
    where i = 1
    
    union all

    select
        df.i, df.a, df.b,
        (1 + df.a) * (cte.c * df.b) + (1 + df.b) * (cte.c * df.a) as c
    from cte
        inner join df on
            df.i = cte.i + 1
)
select * from cte
""").pl()

┌─────┬─────┬─────┬───────────┐
│ i   ┆ a   ┆ b   ┆ c         │
│ --- ┆ --- ┆ --- ┆ ---       │
│ u32 ┆ i64 ┆ i64 ┆ i64       │
╞═════╪═════╪═════╪═══════════╡
│ 1   ┆ 2   ┆ 3   ┆ 51        │
│ 2   ┆ 3   ┆ 7   ┆ 2652      │
│ 3   ┆ 4   ┆ 4   ┆ 106080    │
│ 4   ┆ 5   ┆ 9   ┆ 11032320  │
│ 5   ┆ 8   ┆ 2   ┆ 463357440 │
└─────┴─────┴─────┴───────────┘

another way would be to use numba:

import numba as nb

@nb.guvectorize([(nb.int64[:], nb.int64[:], nb.int64, nb.int64[:])], '(n),(n),()->(n)', nopython=True)
def calc(a, b, c0, c):
    for i in range(0, len(a)):
        c[i] = (1 + a[i]) * (c0 * b[i]) + (1 + b[i]) * (c0 * a[i])
        c0 = c[i]

df.with_columns(c = calc(pl.col('a'), pl.col('b'), initial_c_value))

┌─────┬─────┬─────┬───────────┐
│ i   ┆ a   ┆ b   ┆ c         │
│ --- ┆ --- ┆ --- ┆ ---       │
│ u32 ┆ i64 ┆ i64 ┆ i64       │
╞═════╪═════╪═════╪═══════════╡
│ 1   ┆ 2   ┆ 3   ┆ 51        │
│ 2   ┆ 3   ┆ 7   ┆ 2652      │
│ 3   ┆ 4   ┆ 4   ┆ 106080    │
│ 4   ┆ 5   ┆ 9   ┆ 11032320  │
│ 5   ┆ 8   ┆ 2   ┆ 463357440 │
└─────┴─────┴─────┴───────────┘

Upvotes: 1

Dean MacGregor
Dean MacGregor

Reputation: 18691

Using numba you can make ufuncs which polars can use seamlessly.

from numba import guvectorize, int64
import polars as pl

@guvectorize([(int64[:], int64[:], int64, int64[:])], '(n),(n),()->(n)', nopython=True)
def make_c(a,b,init_c, res):
    res[0]=(1+a[0]) * init_c * b[0] + (1+b[0]) * init_c * a[0]
    for i in range(1,a.shape[0]):
        res[i] = (1+a[i]) * res[i-1] * b[i] + (1+b[i]) * res[i-1] * a[i]
        
df = pl.DataFrame(((2, 3, 4, 5, 8), (3, 7, 4, 9, 2)), schema=('a', 'b'))

df.with_columns(
    c=make_c(pl.col('a'), pl.col('b'), 3)
)
shape: (5, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ c         │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ i64       │
╞═════╪═════╪═══════════╡
│ 2   ┆ 3   ┆ 51        │
│ 3   ┆ 7   ┆ 2652      │
│ 4   ┆ 4   ┆ 106080    │
│ 5   ┆ 9   ┆ 11032320  │
│ 8   ┆ 2   ┆ 463357440 │
└─────┴─────┴───────────┘

The way it works is that the ufunc detects that its input is a polars Expr (ie pl.col() is an Expr) and then it hands control to polars. Because of that you can NOT just do make_c('a','b',3) as then its input is just a str and it won't know what to do with that.

Upvotes: 2

Related Questions