Undersampling a multi-label DataFrame using pandas

Question

I have a DataFrame like this:

file_name                                                label

../input/image-classification-screening/train/...         1
../input/image-classification-screening/train/...         7
../input/image-classification-screening/train/...         9
../input/image-classification-screening/train/...         9
../input/image-classification-screening/train/...         6

And it has 11 classes (0 to 10) and has high class imbalance. Below is the output of train['label'].value_counts():

How do I under-sample this data in pandas so that each class will have below 2500 examples? I want to remove data points randomly from majority classes like 6, 3, 9, 7 and 2.

user7864386 · Accepted Answer

You could create a mask that identifies which "label"s have more than 2500 items and then use groupby+sample (by setting n=n to sample the required number of items) on the ones with more than 2500 items and select all of the labels with less than 2500 items. This creates two DataFrames, one sampled to 2500, and the other selected in whole. Then concatenate the two groups using pd.concat:

n = 2500
msk = df.groupby('label')['label'].transform('size') >= n
df = pd.concat((df[msk].groupby('label').sample(n=n), df[~msk]), ignore_index=True)

For example, if you had a DataFrame like:

df = pd.DataFrame({'ID': range(30),
                   'label': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
                             'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 
                             'B', 'B', 'B', 'B', 'C', 'C', 'D', 'F', 'F', 'G']})

and

>>> df['label'].value_counts()

A    13
B    11
C     2
F     2
D     1
G     1
Name: label, dtype: int64

Then the above code with n=3, yields:

    ID label
0    7     A
1    0     A
2   10     A
3   20     B
4   18     B
5   21     B
6   24     C
7   25     C
8   26     D
9   27     F
10  28     F
11  29     G

with

>>> df['label'].value_counts()

A    3
B    3
C    2
F    2
D    1
G    1
Name: label, dtype: int64

Undersampling a multi-label DataFrame using pandas

Answers (2)

Related Questions