David Masip
David Masip

Reputation: 2491

Sample from groups proportional to score

I have data with the following structure:

CREATE TABLE if not EXISTS scores (
  id int,
  class char,
  score float
);

INSERT INTO scores VALUES
(1, 'A', 0.5),
(1, 'B', 0.2),
(1, 'C', 0.1),
(2, 'A', 0.1),
(2, 'B', 0.2),
(3, 'D', 0.01),
(4, 'A', 0.5),
(4, 'B', 0.5);

I want to randomly sample, for each id, a class. A possible sample obtained would be:

1,'A'
2,'B'
3,'D'
4,'A'

The logic I want to use for the sample is the following: each class is sampled proportionally to its score. For instance:

I'm looking for ways of doing this in BigQuery/PostgreSQL. Also, are there solutions with a fixed random seed that can be reproduced?

Thanks!

Upvotes: 2

Views: 277

Answers (3)

Sergey Geron
Sergey Geron

Reputation: 10172

Possible approach is to generate for each 'id-class' pair a number of rows equivalent to the score (50 '1-A' rows, 20 '1-B' rows, 10 '1-C' rows, etc...) and then select 1 row randomly per id.

For BigQuery:

select id, array_agg(class order by rand() limit 1)[offset(0)]
from scores, unnest(generate_array(1, score * 100))
group by id

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269873

If I understand correctly, you have a column that essentially has weights. You want to use these for random sampling, to extract one row for each id but the likelihood of that row is based on the weight.

The idea is to do the following:

  • Normalize the weights to be a range between 0 and 1. You do this by using a cumulative sum and division.
  • Choose one random number per id.
  • Compare the two.

The logic looks like this:

select s.*
from (select s.*, 
             sum(score) over (partition by id order by class) / sum(score) over (partition by id) as threshold_hi,
             (sum(score) over (partition by id order by class) - score) / sum(score) over (partition by id) as threshold_lo
      from scores s
     ) s join
     (select i.id, random() as rand
      from (select distinct id from scores) i
     ) i
     on i.id = s.id and
        i.rand >= s.threshold_lo and i.rand < s.threshold_hi

Here is a db<>fiddle.

Upvotes: 2

Marko Ivkovic
Marko Ivkovic

Reputation: 1290

In PostgreSQL i know for 2 ways.

1st Way with DISTINCT and RANDOM():

SELECT DISTINCT ON (id) id, class
FROM scores 
ORDER BY id, random();

2nd Way with OVER PARTITION BY and RANDOM():

SELECT id, class 
FROM (
SELECT *, row_number() OVER (PARTITION BY id ORDER BY random()) as rn
FROM scores ) sub
WHERE rn = 1;

Both queries you can check on DB<>FIDDLE

Pay attention that if you ran both queries at same time that you will get different records.

Upvotes: 1

Related Questions