Reputation: 75
I'm working with the Polars library in Python and would like to sample values from a list based on associated probabilities, similar to how numpy.random.choice works. Here's what I'd like to achieve:
In numpy, I can do something like this:
import numpy as np
# Possible outcomes
values = [0, 1, 2, 3, 4]
# Associated probabilities
probabilities = [0.15, 0.30, 0.25, 0.20, 0.10]
# Sample a value based on probabilities
sampled_value = np.random.choice(values, p=probabilities)
print(sampled_value)
This returns a random value from values, chosen according to the probabilities in probabilities.
However, in my use case with Polars, the probabilities are in separate columns of a DataFrame, like this:
import polars as pl
# Example DataFrame with probabilities for each outcome per row
df = pl.DataFrame({
"id": [1, 2, 3],
"prob_0": [0.1, 0.2, 0.15],
"prob_1": [0.3, 0.25, 0.35],
"prob_2": [0.25, 0.3, 0.25],
"prob_3": [0.2, 0.15, 0.15],
"prob_4": [0.15, 0.1, 0.1]
})
Now, I'd like to create a new column, say sampled_goal
, where each row's value is sampled based on the row-specific probabilities in columns prob_0, prob_1, etc. Here’s how I’d like it to work with Polars’s with_columns function:
df = df.with_columns(
np.random.choice(a=[0, 1, 2, 3, 4, 5],p=[prob0, prob1, prob2, prob3, prob4, prob5])
)
Upvotes: 0
Views: 84
Reputation: 13867
We can replicate numpy’s functionality in polars. We use the probabilities to construct the cumulative density function, then use it to sample from the distribution.
df = pl.DataFrame(
{
"id": [1, 2, 3],
"prob_0": [0.1, 0.2, 0.15],
"prob_1": [0.3, 0.25, 0.35],
"prob_2": [0.25, 0.3, 0.25],
"prob_3": [0.2, 0.15, 0.15],
"prob_4": [0.15, 0.1, 0.1],
}
)
df = (
df.select(
pl.cum_sum_horizontal("^prob_[0-9]+$"),
pl.lit(np.random.rand(df.height)).alias("rand"),
)
.unnest("cum_sum")
.with_columns(
pl.sum_horizontal(pl.col("^prob_[0-9]+$").lt(pl.col("rand")).alias("index"))
)
)
df
shape: (3, 7)
┌────────┬────────┬────────┬────────┬────────┬──────────┬───────┐
│ prob_0 ┆ prob_1 ┆ prob_2 ┆ prob_3 ┆ prob_4 ┆ rand ┆ index │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ u32 │
╞════════╪════════╪════════╪════════╪════════╪══════════╪═══════╡
│ 0.1 ┆ 0.4 ┆ 0.65 ┆ 0.85 ┆ 1.0 ┆ 0.96449 ┆ 4 │
│ 0.2 ┆ 0.45 ┆ 0.75 ┆ 0.9 ┆ 1.0 ┆ 0.021633 ┆ 0 │
│ 0.15 ┆ 0.5 ┆ 0.75 ┆ 0.9 ┆ 1.0 ┆ 0.324224 ┆ 1 │
└────────┴────────┴────────┴────────┴────────┴──────────┴───────┘
Upvotes: 1
Reputation: 262214
Assuming you have a single row of data and want to sample the column names you could use numpy.choice
like in your example:
np.random.choice(df.columns, p=df.row(0))
Example output: 'value_3'
If you have many rows, you need to process per row:
cols = df.drop('id').columns
df.with_columns(
pl.concat_list(pl.exclude('id'))
.map_elements(lambda x: np.random.choice(cols, p=x))
.alias('random')
)
If you want to go with numpy, you could compute the cumulated sum of the probabilities and perform a 2D vectorized searchsorted (searchsorted2d
as show here):
tmp = df.drop('id')
cols = np.array(tmp.columns)
cum_prob = np.add.accumulate(tmp.to_numpy(), axis=1)
r = np.random.random(len(cum_prob))
out = df.with_columns(random=cols[searchsorted2d(cum_prob, r)])
Example output:
┌─────┬────────┬────────┬────────┬────────┬────────┬────────┐
│ id ┆ prob_0 ┆ prob_1 ┆ prob_2 ┆ prob_3 ┆ prob_4 ┆ random │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞═════╪════════╪════════╪════════╪════════╪════════╪════════╡
│ 1 ┆ 0.1 ┆ 0.3 ┆ 0.25 ┆ 0.2 ┆ 0.15 ┆ prob_3 │
│ 2 ┆ 0.2 ┆ 0.25 ┆ 0.3 ┆ 0.15 ┆ 0.1 ┆ prob_0 │
│ 3 ┆ 0.15 ┆ 0.35 ┆ 0.25 ┆ 0.15 ┆ 0.1 ┆ prob_1 │
└─────┴────────┴────────┴────────┴────────┴────────┴────────┘
Upvotes: 1