Reputation: 2491
I have data with the following structure:
CREATE TABLE if not EXISTS scores (
id int,
class char,
score float
);
INSERT INTO scores VALUES
(1, 'A', 0.5),
(1, 'B', 0.2),
(1, 'C', 0.1),
(2, 'A', 0.1),
(2, 'B', 0.2),
(3, 'D', 0.01),
(4, 'A', 0.5),
(4, 'B', 0.5);
I want to randomly sample, for each id, a class. A possible sample obtained would be:
1,'A'
2,'B'
3,'D'
4,'A'
The logic I want to use for the sample is the following: each class is sampled proportionally to its score. For instance:
id = 1
, sampling class 'B' should be twice as likely to be sampled than class 'C'.id = 2
, sampling class 'B' should be twice as likely to be sampled than class 'A'.id = 3
, we should only sample class 'D'.id = 4
, sampling class 'B' should be as likely as sampling class 'A'.I'm looking for ways of doing this in BigQuery/PostgreSQL. Also, are there solutions with a fixed random seed that can be reproduced?
Thanks!
Upvotes: 2
Views: 277
Reputation: 10172
Possible approach is to generate for each 'id-class' pair a number of rows equivalent to the score (50 '1-A' rows, 20 '1-B' rows, 10 '1-C' rows, etc...) and then select 1 row randomly per id.
For BigQuery:
select id, array_agg(class order by rand() limit 1)[offset(0)]
from scores, unnest(generate_array(1, score * 100))
group by id
Upvotes: 2
Reputation: 1269873
If I understand correctly, you have a column that essentially has weights. You want to use these for random sampling, to extract one row for each id
but the likelihood of that row is based on the weight.
The idea is to do the following:
id
.The logic looks like this:
select s.*
from (select s.*,
sum(score) over (partition by id order by class) / sum(score) over (partition by id) as threshold_hi,
(sum(score) over (partition by id order by class) - score) / sum(score) over (partition by id) as threshold_lo
from scores s
) s join
(select i.id, random() as rand
from (select distinct id from scores) i
) i
on i.id = s.id and
i.rand >= s.threshold_lo and i.rand < s.threshold_hi
Here is a db<>fiddle.
Upvotes: 2
Reputation: 1290
In PostgreSQL
i know for 2 ways.
1st Way with DISTINCT
and RANDOM()
:
SELECT DISTINCT ON (id) id, class
FROM scores
ORDER BY id, random();
2nd Way with OVER PARTITION BY
and RANDOM()
:
SELECT id, class
FROM (
SELECT *, row_number() OVER (PARTITION BY id ORDER BY random()) as rn
FROM scores ) sub
WHERE rn = 1;
Both queries you can check on DB<>FIDDLE
Pay attention that if you ran both queries at same time that you will get different records.
Upvotes: 1