SQL/Snowflake Sampling with specific probability

Question

Suppose I have table 1 below, how can I select the values from table 1 with the specified probabilities, where each probability is the chance of the respective value getting selected?

Table 1:

Group  Value  Probability
A      1      5%
A      10     5%
A      50     20%
A      30     70% 
B      5      5%
B      25     70%
B      100    25%

A possible outcome is (assuming 30 and 25 are selected simply because of their higher probabilities):

Table 2:

Group  Value  
A      30     
B      25

I'm trying to solve this on Snowflake and have not been able to through various methods, including partitioning the values and comparing their ranks, as well as using the Uniform function to create random probabilities. Not sure if there's a more elegant way to do a sampling and partition by Group. The end goal is to have the Value field in Table 1 deduplicated, so that each value is given a chance of getting selected based on their probabilities.

Felipe Hoffa · Accepted Answer

Give each group a consecutive range. For example, for 15%, the range will be between 30 and 45.
Pick a random number between 0 and 100.
Find in which range that random number falls:

create or replace temp table probs
as 
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30 
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5 
union all select 'b', 2, 7 
union all select 'b', 3, 8
union all select 'b', 4, 80;

with calculated_ranges as (
    select *, range_prob2-prob range_prob1
    from (
        select *, sum(prob) over(partition by id order by prob) range_prob2
        from probs
    )
)

select id, random_draw, value, prob
from (
  select id, any_value(uniform(0, 100, random())) random_draw 
  from probs group by id
) a
join calculated_ranges b
using (id)
where range_prob1<=random_draw and range_prob2>random_draw
;

SQL/Snowflake Sampling with specific probability

Answers (2)

Related Questions