Reputation: 15
I want to perform stratified sampling for each group in SQL. Column ID has 2 levels (each level has 100K observations)and each level is further divided by column2 having levels I and J. I want to extract randomly 100 observations for each ID that is 100 for 'A' and 100 for 'B' and that 100 observations should have 33% of 'I' (randomly) and 67% of 'J' (randomly) for each A and B. I am using postgresql
ID Column2
A J
A J
A I
A I
A I
B J
B I
B I
B I
Upvotes: 0
Views: 3414
Reputation: 1269623
You can use window functions:
select t.*
from (select t.*,
row_number() over (partition by id, column2 order by random()) as seqnum
from t
where id in ('A', 'B') and column2 in ('I', 'J')
) t
where column2 = 'I' and seqnum <= 33 or
column2 = 'J' and seqnum <= 67;
"random" functions differ by database. This uses random()
as a stand-in. The specific function depends on the database you are using.
Upvotes: 2