Stratified Sampling based on a column category in postgres

Question

I have a dataset which has around 280K observations. However bulk of that observation belong to one particular value of a column. Here is the count

Category    Count
A   8981
B   6146
C   243369
D   10678
F   4431
G   2421
H   1009
I   7948

If you see category C has 243K such record while others have very. I want to down sample the data in such a way that I keep all the records from other category values while only keeping a certain percentage of random records from category 'C'. Like maybe 10% or so. This way I can have proper distribution of data across all these categories.

How to achieve this in postgres?

klin · Accepted Answer

The query selects one random row from the table modified in the way that contains only 10% of rows with category C and all rows with other categories:

select *
from (
    (select *
    from observations
    where category = 'C'
    order by random()
    limit 243369* 0.1)
    union
    (select *
    from observations
    where category <> 'C')
    ) sub
order by random()
limit 1;

Stratified Sampling based on a column category in postgres

Answers (1)

Related Questions