TheRealBenbo
TheRealBenbo

Reputation: 153

How to Write Poisson CDF as Python Polars Expression

I have a collection of polars expressions being used to generate features for an ML model. I'd like to add a poission cdf feature to this collection whilst maintaining lazy execution (with benefits of speed, caching etc...). I so far have not found an easy way of achieving this.

I've been able to get the result I'd like outside of the desired lazy expression framework with:

import polars as pl
from scipy.stats import poisson

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poisson_cdf"))

However, in reality I'd like this to look like:

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
    [
        ... # bunch of other expressions here
        poisson_cdf()
    ]
)

where poisson_cdf is some polars expression like:

def poisson_cdf():
    # this is just for illustration, clearly wont work
    return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")

I also tried using a struct made up of "count" and "expected_count" and apply like advised in the docs when applying custom functions. However, my dataset is several millions of rows in reality - leading to absurd execution time.

Any advice or guidance here would be appreciated. Ideally there exists an expression like this somewhere out there? Thanks in advance!

Upvotes: 3

Views: 737

Answers (3)

Dean MacGregor
Dean MacGregor

Reputation: 18626

You want to take advantage of the fact that scipy has a set of functions which are numpy ufuncs as those

still have fast columnar operation through the NumPy API.

Specifically you want the pdtr function.

You then want to use reduce rather than map or apply as those are for generic python functions and aren't going to perform as well.

So if we have...

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))

then we can add to it with

df=df.with_columns(
    pl.reduce(function=pdtr, exprs=[pl.col('count'),pl.col('expected_count')]).alias("poicdf")
)
df

shape: (5, 4)
┌───────┬────────────────┬──────────────┬──────────┐
│ count ┆ expected_count ┆ poission_cdf ┆ poicdf   │
│ ---   ┆ ---            ┆ ---          ┆ ---      │
│ i64   ┆ f64            ┆ f64          ┆ f64      │
╞═══════╪════════════════╪══════════════╪══════════╡
│ 9     ┆ 7.7            ┆ 0.75308      ┆ 0.75308  │
│ 2     ┆ 0.2            ┆ 0.998852     ┆ 0.998852 │
│ 3     ┆ 0.7            ┆ 0.994247     ┆ 0.994247 │
│ 4     ┆ 1.1            ┆ 0.994565     ┆ 0.994565 │
│ 5     ┆ 7.5            ┆ 0.241436     ┆ 0.241436 │
└───────┴────────────────┴──────────────┴──────────┘

You can see it gives the same answer.

Upvotes: 1

jqurious
jqurious

Reputation: 21464

It sounds like you want to use .map_batches()

df.with_columns(
   pl.struct("count", "expected_count")
     .map_batches(lambda x: 
        poisson.cdf(x.struct.field("count"), x.struct.field("expected_count"))
     )
     .alias("poisson_cdf")
)
shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ ---   | ---            | ---         │
│ i64   | f64            | f64         │
╞═══════╪════════════════╪═════════════╡
│ 9     | 7.7            | 0.75308     │
│ 2     | 0.2            | 0.998852    │
│ 3     | 0.7            | 0.994247    │
│ 4     | 1.1            | 0.994565    │
│ 5     | 7.5            | 0.241436    │
└───────┴────────────────┴─────────────┘

Upvotes: 1

Alexander Belopolsky
Alexander Belopolsky

Reputation: 2268

If scipy.stats.poisson.cdf was implemented as a proper numpy universal function, it would be possible to use it directly on polars expressions, but it is not. Fortunately, Poisson CDF is almost the same as regularized upper incomplete gamma function for which scipy supplies gammaincc which can be used in polars expressions:

>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k') + 1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k   ┆ cdf      │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 0   ┆ 0.018316 │
│ 1   ┆ 0.091578 │
│ 2   ┆ 0.238103 │
│ 3   ┆ 0.43347  │
│ ... ┆ ...      │
│ 6   ┆ 0.889326 │
│ 7   ┆ 0.948866 │
│ 8   ┆ 0.978637 │
│ 9   ┆ 0.991868 │
└─────┴──────────┘

The result is the same as returned by poisson.cdf:

>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k   ┆ cdf      ┆ cdf2     │
│ --- ┆ ---      ┆ ---      │
│ i64 ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╡
│ 0   ┆ 0.018316 ┆ 0.018316 │
│ 1   ┆ 0.091578 ┆ 0.091578 │
│ 2   ┆ 0.238103 ┆ 0.238103 │
│ 3   ┆ 0.43347  ┆ 0.43347  │
│ ... ┆ ...      ┆ ...      │
│ 6   ┆ 0.889326 ┆ 0.889326 │
│ 7   ┆ 0.948866 ┆ 0.948866 │
│ 8   ┆ 0.978637 ┆ 0.978637 │
│ 9   ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘

Upvotes: 3

Related Questions