Reputation: 153
I have a collection of polars expressions being used to generate features for an ML model. I'd like to add a poission cdf feature to this collection whilst maintaining lazy execution (with benefits of speed, caching etc...). I so far have not found an easy way of achieving this.
I've been able to get the result I'd like outside of the desired lazy expression framework with:
import polars as pl
from scipy.stats import poisson
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poisson_cdf"))
However, in reality I'd like this to look like:
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
[
... # bunch of other expressions here
poisson_cdf()
]
)
where poisson_cdf
is some polars expression like:
def poisson_cdf():
# this is just for illustration, clearly wont work
return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")
I also tried using a struct made up of "count"
and "expected_count"
and apply like advised in the docs when applying custom functions. However, my dataset is several millions of rows in reality - leading to absurd execution time.
Any advice or guidance here would be appreciated. Ideally there exists an expression like this somewhere out there? Thanks in advance!
Upvotes: 3
Views: 737
Reputation: 18626
You want to take advantage of the fact that scipy has a set of functions which are numpy ufuncs as those
still have fast columnar operation through the NumPy API.
Specifically you want the pdtr function.
You then want to use reduce
rather than map
or apply
as those are for generic python functions and aren't going to perform as well.
So if we have...
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))
then we can add to it with
df=df.with_columns(
pl.reduce(function=pdtr, exprs=[pl.col('count'),pl.col('expected_count')]).alias("poicdf")
)
df
shape: (5, 4)
┌───────┬────────────────┬──────────────┬──────────┐
│ count ┆ expected_count ┆ poission_cdf ┆ poicdf │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪════════════════╪══════════════╪══════════╡
│ 9 ┆ 7.7 ┆ 0.75308 ┆ 0.75308 │
│ 2 ┆ 0.2 ┆ 0.998852 ┆ 0.998852 │
│ 3 ┆ 0.7 ┆ 0.994247 ┆ 0.994247 │
│ 4 ┆ 1.1 ┆ 0.994565 ┆ 0.994565 │
│ 5 ┆ 7.5 ┆ 0.241436 ┆ 0.241436 │
└───────┴────────────────┴──────────────┴──────────┘
You can see it gives the same answer.
Upvotes: 1
Reputation: 21464
It sounds like you want to use .map_batches()
df.with_columns(
pl.struct("count", "expected_count")
.map_batches(lambda x:
poisson.cdf(x.struct.field("count"), x.struct.field("expected_count"))
)
.alias("poisson_cdf")
)
shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ --- | --- | --- │
│ i64 | f64 | f64 │
╞═══════╪════════════════╪═════════════╡
│ 9 | 7.7 | 0.75308 │
│ 2 | 0.2 | 0.998852 │
│ 3 | 0.7 | 0.994247 │
│ 4 | 1.1 | 0.994565 │
│ 5 | 7.5 | 0.241436 │
└───────┴────────────────┴─────────────┘
Upvotes: 1
Reputation: 2268
If scipy.stats.poisson.cdf
was implemented as a proper numpy universal function, it would be possible to use it directly on polars expressions, but it is not. Fortunately, Poisson CDF is almost the same as regularized upper incomplete gamma function for which scipy supplies gammaincc which can be used in polars expressions:
>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k') + 1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k ┆ cdf │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.018316 │
│ 1 ┆ 0.091578 │
│ 2 ┆ 0.238103 │
│ 3 ┆ 0.43347 │
│ ... ┆ ... │
│ 6 ┆ 0.889326 │
│ 7 ┆ 0.948866 │
│ 8 ┆ 0.978637 │
│ 9 ┆ 0.991868 │
└─────┴──────────┘
The result is the same as returned by poisson.cdf:
>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k ┆ cdf ┆ cdf2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 0 ┆ 0.018316 ┆ 0.018316 │
│ 1 ┆ 0.091578 ┆ 0.091578 │
│ 2 ┆ 0.238103 ┆ 0.238103 │
│ 3 ┆ 0.43347 ┆ 0.43347 │
│ ... ┆ ... ┆ ... │
│ 6 ┆ 0.889326 ┆ 0.889326 │
│ 7 ┆ 0.948866 ┆ 0.948866 │
│ 8 ┆ 0.978637 ┆ 0.978637 │
│ 9 ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘
Upvotes: 3