Reputation: 7490
I have a dataset which has around 280K observations. However bulk of that observation belong to one particular value of a column. Here is the count
Category Count
A 8981
B 6146
C 243369
D 10678
F 4431
G 2421
H 1009
I 7948
If you see category C has 243K such record while others have very. I want to down sample the data in such a way that I keep all the records from other category values while only keeping a certain percentage of random records from category 'C'. Like maybe 10% or so. This way I can have proper distribution of data across all these categories.
How to achieve this in postgres?
Upvotes: 1
Views: 2566
Reputation: 121604
The query selects one random row from the table modified in the way that contains only 10% of rows with category C
and all rows with other categories:
select *
from (
(select *
from observations
where category = 'C'
order by random()
limit 243369* 0.1)
union
(select *
from observations
where category <> 'C')
) sub
order by random()
limit 1;
Upvotes: 1