Reputation: 59
I have a table that looks like the following:
user_id | target |
---|---|
1278 | 1 |
9809 | 0 |
3345 | 0 |
9800 | 0 |
1298 | 1 |
1223 | 0 |
My goal is to perform undersampling which means that I want to randomly select a subset of users that have a target of 0 while keeping all users that have a target of 1 value. I have tried the following code, however, since the user_ids are all unique, it doesn't remove the rows with the target of 0 randomly. Any idea what I need to do?
select *
from (select user_id, target, row_number() over (partition by user_id, target order by rand()) as seq
from dataset.mytable
) a
where target = 1 or seq = 1
Upvotes: 0
Views: 300
Reputation: 172994
Consider below approach - it leaves all target=1 rows and ~50% of target=0 rows
select *
from `dataset.mytable`
where if(target = 1, true, rand() < 0.5)
Upvotes: 1
Reputation: 1269623
One method uses window functions:
select t.* except (seqnum, cnt1)
from (select t.*,
row_number() over (partition by target order by rand()) as seqnum,
countif(target = 1) over () as cnt1
from t
) t
where seqnum <= cnt1;
The above might have performance problems -- or even exceed resources because of the large volume of data being sorted. An approximate method might also work for your purposes:
select t.* except (cnt, cnt1)
from (select t.*,
count(*) over (partition by target) as cnt,
countif(target = 1) over () as cnt1
from t
) t
where rand() < cnt * 1.0 / cnt1;
This is not guaranteed to produce exactly the same numbers of 0
and 1
, but the numbers will be quite close.
Upvotes: 1