Pedro_Siqueira
Pedro_Siqueira

Reputation: 75

How to sample values based on probabilities with Polars like numpy.random.choice?

I'm working with the Polars library in Python and would like to sample values from a list based on associated probabilities, similar to how numpy.random.choice works. Here's what I'd like to achieve:

In numpy, I can do something like this:

import numpy as np

# Possible outcomes
values = [0, 1, 2, 3, 4]

# Associated probabilities
probabilities = [0.15, 0.30, 0.25, 0.20, 0.10]

# Sample a value based on probabilities
sampled_value = np.random.choice(values, p=probabilities)
print(sampled_value)

This returns a random value from values, chosen according to the probabilities in probabilities.

However, in my use case with Polars, the probabilities are in separate columns of a DataFrame, like this:

import polars as pl

# Example DataFrame with probabilities for each outcome per row
df = pl.DataFrame({
    "id": [1, 2, 3],
    "prob_0": [0.1, 0.2, 0.15],
    "prob_1": [0.3, 0.25, 0.35],
    "prob_2": [0.25, 0.3, 0.25],
    "prob_3": [0.2, 0.15, 0.15],
    "prob_4": [0.15, 0.1, 0.1]
})

Now, I'd like to create a new column, say sampled_goal, where each row's value is sampled based on the row-specific probabilities in columns prob_0, prob_1, etc. Here’s how I’d like it to work with Polars’s with_columns function:

df = df.with_columns(
     np.random.choice(a=[0, 1, 2, 3, 4, 5],p=[prob0, prob1, prob2, prob3, prob4, prob5])
)

Upvotes: 0

Views: 84

Answers (2)

BallpointBen
BallpointBen

Reputation: 13867

We can replicate numpy’s functionality in polars. We use the probabilities to construct the cumulative density function, then use it to sample from the distribution.

df = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "prob_0": [0.1, 0.2, 0.15],
        "prob_1": [0.3, 0.25, 0.35],
        "prob_2": [0.25, 0.3, 0.25],
        "prob_3": [0.2, 0.15, 0.15],
        "prob_4": [0.15, 0.1, 0.1],
    }
)

df = (
    df.select(
        pl.cum_sum_horizontal("^prob_[0-9]+$"),
        pl.lit(np.random.rand(df.height)).alias("rand"),
    )
    .unnest("cum_sum")
    .with_columns(
        pl.sum_horizontal(pl.col("^prob_[0-9]+$").lt(pl.col("rand")).alias("index"))
    )
)
df
shape: (3, 7)
┌────────┬────────┬────────┬────────┬────────┬──────────┬───────┐
│ prob_0 ┆ prob_1 ┆ prob_2 ┆ prob_3 ┆ prob_4 ┆ rand     ┆ index │
│ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---      ┆ ---   │
│ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64      ┆ u32   │
╞════════╪════════╪════════╪════════╪════════╪══════════╪═══════╡
│ 0.1    ┆ 0.4    ┆ 0.65   ┆ 0.85   ┆ 1.0    ┆ 0.96449  ┆ 4     │
│ 0.2    ┆ 0.45   ┆ 0.75   ┆ 0.9    ┆ 1.0    ┆ 0.021633 ┆ 0     │
│ 0.15   ┆ 0.5    ┆ 0.75   ┆ 0.9    ┆ 1.0    ┆ 0.324224 ┆ 1     │
└────────┴────────┴────────┴────────┴────────┴──────────┴───────┘

Upvotes: 1

mozway
mozway

Reputation: 262214

Assuming you have a single row of data and want to sample the column names you could use numpy.choice like in your example:

np.random.choice(df.columns, p=df.row(0))

Example output: 'value_3'

If you have many rows, you need to process per row:

cols = df.drop('id').columns

df.with_columns(
    pl.concat_list(pl.exclude('id'))
    .map_elements(lambda x: np.random.choice(cols, p=x))
    .alias('random')
)

If you want to go with , you could compute the cumulated sum of the probabilities and perform a 2D vectorized searchsorted (searchsorted2d as show here):

tmp = df.drop('id')
cols = np.array(tmp.columns)

cum_prob = np.add.accumulate(tmp.to_numpy(), axis=1)
r = np.random.random(len(cum_prob))

out = df.with_columns(random=cols[searchsorted2d(cum_prob, r)])

Example output:

┌─────┬────────┬────────┬────────┬────────┬────────┬────────┐
│ id  ┆ prob_0 ┆ prob_1 ┆ prob_2 ┆ prob_3 ┆ prob_4 ┆ random │
│ --- ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ i64 ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ str    │
╞═════╪════════╪════════╪════════╪════════╪════════╪════════╡
│ 1   ┆ 0.1    ┆ 0.3    ┆ 0.25   ┆ 0.2    ┆ 0.15   ┆ prob_3 │
│ 2   ┆ 0.2    ┆ 0.25   ┆ 0.3    ┆ 0.15   ┆ 0.1    ┆ prob_0 │
│ 3   ┆ 0.15   ┆ 0.35   ┆ 0.25   ┆ 0.15   ┆ 0.1    ┆ prob_1 │
└─────┴────────┴────────┴────────┴────────┴────────┴────────┘

Upvotes: 1

Related Questions