K.Y
K.Y

Reputation: 33

SQL/Snowflake Sampling with specific probability

Suppose I have table 1 below, how can I select the values from table 1 with the specified probabilities, where each probability is the chance of the respective value getting selected?

Table 1:

Group  Value  Probability
A      1      5%
A      10     5%
A      50     20%
A      30     70% 
B      5      5%
B      25     70%
B      100    25%

A possible outcome is (assuming 30 and 25 are selected simply because of their higher probabilities):

Table 2:

Group  Value  
A      30     
B      25  

I'm trying to solve this on Snowflake and have not been able to through various methods, including partitioning the values and comparing their ranks, as well as using the Uniform function to create random probabilities. Not sure if there's a more elegant way to do a sampling and partition by Group. The end goal is to have the Value field in Table 1 deduplicated, so that each value is given a chance of getting selected based on their probabilities.

Upvotes: 2

Views: 987

Answers (2)

K.Y
K.Y

Reputation: 33

Felipe's answer is great, it definitely solved the problem.

While trying out different approaches yesterday, I tested out this approach on Felipe's table and it seems to be working as well.

I'm giving each record a random probability and comparing against the actual probability. The idea is that if the random probability is less than or equal to the actual probability, then it's accepted and the partitioning will do the deduplication based on a descending order with the probabilities.

create or replace temp table probs
as 
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30 
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5 
union all select 'b', 2, 7 
union all select 'b', 3, 8
union all select 'b', 4, 80;

create or replace temp table t2 as 
select *,
       min(compare_prob) over(partition by id) as min_compare_prob,       
       max(compare_prob) over(partition by id) as max_compare_prob,
       min_compare_prob <> max_compare_prob as not_all_identical          --min_rank2 <> max_rank2 checks if all records (by group) have different values
from (select id,
             value,
             prob,
             UNIFORM(0.00001::float,1::float,random(2)) as rand_prob,     --random probability
             case when prob >= rand_prob then 1 else 0 end as compare_prob                                                         
      from (select id, value, prob/100 as prob from probs) 
     );

--dedeup results
select id, value, prob, rand_prob
from (select *,
             row_number() over(partition by id order by prob desc, rand_prob desc) as rn
      from t2
      where not_all_identical = FALSE
      union all
      select *,
             row_number() over(partition by id order by prob desc, COMPARE_PROB desc) as rn
      from t2
      where not_all_identical = TRUE)
where rn = 1;

Upvotes: 0

Felipe Hoffa
Felipe Hoffa

Reputation: 59225

  • Give each group a consecutive range. For example, for 15%, the range will be between 30 and 45.
  • Pick a random number between 0 and 100.
  • Find in which range that random number falls:
create or replace temp table probs
as 
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30 
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5 
union all select 'b', 2, 7 
union all select 'b', 3, 8
union all select 'b', 4, 80;

with calculated_ranges as (
    select *, range_prob2-prob range_prob1
    from (
        select *, sum(prob) over(partition by id order by prob) range_prob2
        from probs
    )
)

select id, random_draw, value, prob
from (
  select id, any_value(uniform(0, 100, random())) random_draw 
  from probs group by id
) a
join calculated_ranges b
using (id)
where range_prob1<=random_draw and range_prob2>random_draw
;

enter image description here

Upvotes: 1

Related Questions