vicky
vicky

Reputation: 15

Stratified sampling for each group in sql

I want to perform stratified sampling for each group in SQL. Column ID has 2 levels (each level has 100K observations)and each level is further divided by column2 having levels I and J. I want to extract randomly 100 observations for each ID that is 100 for 'A' and 100 for 'B' and that 100 observations should have 33% of 'I' (randomly) and 67% of 'J' (randomly) for each A and B. I am using postgresql

ID    Column2
A         J
A         J
A         I
A         I
A         I
B         J
B         I
B         I
B         I

Upvotes: 0

Views: 3414

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269623

You can use window functions:

select t.*
from (select t.*,
             row_number() over (partition by id, column2 order by random()) as seqnum
      from t
      where id in ('A', 'B') and column2 in ('I', 'J')
     ) t
where column2 = 'I' and seqnum <= 33 or
      column2 = 'J' and seqnum <= 67;

"random" functions differ by database. This uses random() as a stand-in. The specific function depends on the database you are using.

Upvotes: 2

Related Questions