Reputation: 33
Suppose I have table 1 below, how can I select the values from table 1 with the specified probabilities, where each probability is the chance of the respective value getting selected?
Table 1:
Group Value Probability
A 1 5%
A 10 5%
A 50 20%
A 30 70%
B 5 5%
B 25 70%
B 100 25%
A possible outcome is (assuming 30 and 25 are selected simply because of their higher probabilities):
Table 2:
Group Value
A 30
B 25
I'm trying to solve this on Snowflake and have not been able to through various methods, including partitioning the values and comparing their ranks, as well as using the Uniform function to create random probabilities. Not sure if there's a more elegant way to do a sampling and partition by Group. The end goal is to have the Value field in Table 1 deduplicated, so that each value is given a chance of getting selected based on their probabilities.
Upvotes: 2
Views: 987
Reputation: 33
Felipe's answer is great, it definitely solved the problem.
While trying out different approaches yesterday, I tested out this approach on Felipe's table and it seems to be working as well.
I'm giving each record a random probability and comparing against the actual probability. The idea is that if the random probability is less than or equal to the actual probability, then it's accepted and the partitioning will do the deduplication based on a descending order with the probabilities.
create or replace temp table probs
as
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5
union all select 'b', 2, 7
union all select 'b', 3, 8
union all select 'b', 4, 80;
create or replace temp table t2 as
select *,
min(compare_prob) over(partition by id) as min_compare_prob,
max(compare_prob) over(partition by id) as max_compare_prob,
min_compare_prob <> max_compare_prob as not_all_identical --min_rank2 <> max_rank2 checks if all records (by group) have different values
from (select id,
value,
prob,
UNIFORM(0.00001::float,1::float,random(2)) as rand_prob, --random probability
case when prob >= rand_prob then 1 else 0 end as compare_prob
from (select id, value, prob/100 as prob from probs)
);
--dedeup results
select id, value, prob, rand_prob
from (select *,
row_number() over(partition by id order by prob desc, rand_prob desc) as rn
from t2
where not_all_identical = FALSE
union all
select *,
row_number() over(partition by id order by prob desc, COMPARE_PROB desc) as rn
from t2
where not_all_identical = TRUE)
where rn = 1;
Upvotes: 0
Reputation: 59225
create or replace temp table probs
as
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5
union all select 'b', 2, 7
union all select 'b', 3, 8
union all select 'b', 4, 80;
with calculated_ranges as (
select *, range_prob2-prob range_prob1
from (
select *, sum(prob) over(partition by id order by prob) range_prob2
from probs
)
)
select id, random_draw, value, prob
from (
select id, any_value(uniform(0, 100, random())) random_draw
from probs group by id
) a
join calculated_ranges b
using (id)
where range_prob1<=random_draw and range_prob2>random_draw
;
Upvotes: 1