user16228179
user16228179

Reputation: 59

SQL: perform undersampling to select a subset of majority class

I have a table that looks like the following:

user_id target
1278 1
9809 0
3345 0
9800 0
1298 1
1223 0

My goal is to perform undersampling which means that I want to randomly select a subset of users that have a target of 0 while keeping all users that have a target of 1 value. I have tried the following code, however, since the user_ids are all unique, it doesn't remove the rows with the target of 0 randomly. Any idea what I need to do?

select *
from (select user_id, target, row_number() over (partition by user_id, target order by rand()) as seq
from dataset.mytable
) a
where target = 1 or seq = 1


Upvotes: 0

Views: 300

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

Consider below approach - it leaves all target=1 rows and ~50% of target=0 rows

select * 
from `dataset.mytable`
where if(target = 1, true, rand() < 0.5)

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269623

One method uses window functions:

select t.* except (seqnum, cnt1)
from (select t.*,
             row_number() over (partition by target order by rand()) as seqnum,
             countif(target = 1) over () as cnt1
      from t
     ) t
where seqnum <= cnt1;

The above might have performance problems -- or even exceed resources because of the large volume of data being sorted. An approximate method might also work for your purposes:

select t.* except (cnt, cnt1)
from (select t.*,
             count(*) over (partition by target) as cnt,
             countif(target = 1) over () as cnt1
      from t
     ) t
where rand() < cnt * 1.0 / cnt1;

This is not guaranteed to produce exactly the same numbers of 0 and 1, but the numbers will be quite close.

Upvotes: 1

Related Questions