Reputation: 1113
I have a polars dataframe illustrated as follows.
import polars as pl
df = pl.DataFrame(
{
"a": [1, 4, 3, 2, 8, 4, 5, 6],
"b": [2, 3, 1, 3, 9, 7, 6, 8],
"c": [1, 1, 1, 1, 2, 2, 2, 2],
}
)
The task I have is
So, for the example above, the output I want is as follows (exploded after groupby for better illustration).
shape: (8, 2)
┌─────┬─────┐
│ c ┆ a │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
│ 1 ┆ 3 │
│ 1 ┆ 1 │
│ 1 ┆ 2 │
│ 2 ┆ 8 │
│ 2 ┆ 4 │
│ 2 ┆ 5 │
│ 2 ┆ 6 │
└─────┴─────┘
With the assumption,
>>> import numpy as np
>>> convert(np.array([1, 4, 3, 2]), np.array([2, 3, 1, 3]))
np.array([1, 3, 1, 2])
# [1, 4, 3, 2] is from column a of df when column c is 1, and [2, 3, 1, 3] comes from column b of df when column c is 1.
# I have to apply my custom python function 'convert' for the c == 1 group, because not all values in a are smaller than those in b according to the task description above.
My question is how am I supposed to implement this logic in a performant or polars idiomatic way without sacrificing so much speed gained from running Rust code and parallelization?
The reason I ask is because from my understanding, using apply with custom python function will slow down the program, but in my case, in certain scenarios, I will not need to resort to a third-party function for help. So, is there any way I can get the best of worlds somehow? (for scenarios where no third-party function is required, get full benefits of polars, and only apply third-party function when necessary).
Upvotes: 1
Views: 1731
Reputation: 21249
It sounds like you want to find matching groups:
(
df
.with_row_count()
.filter(
(pl.col("a") >= pl.col("b"))
.any()
.over("c"))
)
shape: (4, 4)
┌────────┬─────┬─────┬─────┐
│ row_nr | a | b | c │
│ --- | --- | --- | --- │
│ u32 | i64 | i64 | i64 │
╞════════╪═════╪═════╪═════╡
│ 0 | 1 | 2 | 1 │
│ 1 | 4 | 3 | 1 │
│ 2 | 3 | 1 | 1 │
│ 3 | 2 | 3 | 1 │
└────────┴─────┴─────┴─────┘
And apply your custom function over each group.
(
df
.with_row_count()
.filter(
(pl.col("a") >= pl.col("b"))
.any()
.over("c"))
.select(
pl.col("row_nr"),
pl.apply(
["a", "b"], # np.minimum is just for example purposes
lambda s: np.minimum(s[0], s[1]))
.over("c"))
)
shape: (4, 2)
┌────────┬─────┐
│ row_nr | a │
│ --- | --- │
│ u32 | i64 │
╞════════╪═════╡
│ 0 | 1 │
│ 1 | 3 │
│ 2 | 1 │
│ 3 | 2 │
└────────┴─────┘
(Note: there may be some useful information in How to Write Poisson CDF as Python Polars Expression with regards to scipy/numpy ufuncs and potentially avoiding .apply()
)
You can then .join()
the result back into the original data.
(
df
.with_row_count()
.join(
df
.with_row_count()
.filter(
(pl.col("a") >= pl.col("b"))
.any()
.over("c"))
.select(
pl.col("row_nr"),
pl.apply(
["a", "b"],
lambda s: np.minimum(s[0], s[1]))
.over("c")),
on="row_nr",
how="left")
)
shape: (8, 5)
┌────────┬─────┬─────┬─────┬─────────┐
│ row_nr | a | b | c | a_right │
│ --- | --- | --- | --- | --- │
│ u32 | i64 | i64 | i64 | i64 │
╞════════╪═════╪═════╪═════╪═════════╡
│ 0 | 1 | 2 | 1 | 1 │
│ 1 | 4 | 3 | 1 | 3 │
│ 2 | 3 | 1 | 1 | 1 │
│ 3 | 2 | 3 | 1 | 2 │
│ 4 | 8 | 9 | 2 | null │
│ 5 | 4 | 7 | 2 | null │
│ 6 | 5 | 6 | 2 | null │
│ 7 | 6 | 8 | 2 | null │
└────────┴─────┴─────┴─────┴─────────┘
You can then fill in the nulls.
.with_columns(
pl.col("a_right").fill_null(pl.col("a")))
Upvotes: 4