Shlomi Schwartz
Shlomi Schwartz

Reputation: 8913

Python - Pandas, Resample dataset to have balanced classes

With the following data frame, with only 2 possible lables:

   name  f1  f2  label
0     A   8   9      1
1     A   5   3      1
2     B   8   9      0
3     C   9   2      0
4     C   8   1      0
5     C   9   1      0
6     D   2   1      0
7     D   9   7      0
8     D   3   1      0
9     E   5   1      1
10    E   3   6      1
11    E   7   1      1

I've written a code to group the data by the 'name' column and pivot the result into a numpy array, so each row is a collection of all the samples of a specific group, and the lables are another numpy array:


[[8 9] [5 3] [0 0]] # A lable = 1
[[8 9] [0 0] [0 0]] # B lable = 0
[[9 2] [8 1] [9 1]] # C lable = 0
[[2 1] [9 7] [3 1]] # D lable = 0
[[5 1] [3 6] [7 1]] # E lable = 1




import pandas as pd
import numpy as np

def prepare_data(group_name):
    df = pd.read_csv("../data/tmp.csv")

    group_index = df.groupby(group_name).cumcount()
    data = (df.set_index([group_name, group_index])

    target = np.array(data['label'].groupby(level=0).apply(lambda x: [x.values[0]]).tolist())
    data = data.loc[:, data.columns != 'label']
    data = np.array(data.groupby(level=0).apply(lambda x: x.values.tolist()).tolist())


I would like to resample and delete instances from the over-represented class.


[[8 9] [5 3] [0 0]] # A lable = 1
[[8 9] [0 0] [0 0]] # B lable = 0
[[9 2] [8 1] [9 1]] # C lable = 0
# group D was deleted randomly from the '0' labels 
[[5 1] [3 6] [7 1]] # E lable = 1

would be an acceptable solution, since removing D (labeled '0') will result with a balanced dataset of 2 * label '1' and 2 * label '0'.

Upvotes: 8

Views: 19652

Answers (5)


Reputation: 1426

Using imbalanced-learn (pip install imbalanced-learn), this is as simple as:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy='not minority', random_state=1)
df_balanced, balanced_labels = rus.fit_resample(df, df['label'])

There are many methods other than RandomUnderSampler, so I suggest you read the documentation.

Upvotes: 3


Reputation: 1473

You can make use of a grouped representation for resampling.

def balance_df(frame: pd.DataFrame, col: str, upsample_minority: bool):
    grouped = frame.groupby(col)
    n_samp = {
        True: grouped.size().max(),
        False: grouped.size().min(),

    fun = lambda x: x.sample(n_samp, replace=upsample_minority)
    balanced = grouped.apply(fun)
    balanced = balanced.reset_index(drop=True)
    return balanced

Upvotes: 1

Ashwin Geet D'Sa
Ashwin Geet D'Sa

Reputation: 7369

You can also sample from majority class based on the minority class:

### Separate the majority and minority classes
df_miority  = df[df['label']==1]
df_majority = df[df['label']==0]

### Now, downsamples majority labels equal to the number of samples in the minority class

df_majority = df_majority.sample(len(df_minority), random_state=0)

### concat the majority and minority dataframes
df = pd.concat([df_majority,df_minority])

## Shuffle the dataset to prevent the model from getting biased by similar samples
df = df.sample(frac=1, random_state=0)

Upvotes: 1


Reputation: 654

A very simple approach. Taken from sklearn documentation and Kaggle.

from sklearn.utils import resample

df_majority = df[df.label==0]
df_minority = df[df.label==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=20,    # to match majority class
                                 random_state=42) # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts

Upvotes: 11


Reputation: 36249

Provided that each name is labeled by exactly one label (e.g. all A are 1) you can use the following:

  1. Group the names by label and check which label has an excess (in terms of unique names).
  2. Randomly remove names from the over-represented label class in order to account for the excess.
  3. Select the part of the data frame which does not contain the removed names.

Here is the code:

labels = df.groupby('label').name.unique()
# Sort the over-represented class to the head.
labels = labels[labels.apply(len).sort_values(ascending=False).index]
excess = len(labels.iloc[0]) - len(labels.iloc[1])
remove = np.random.choice(labels.iloc[0], excess, replace=False)
df2 = df[]

Upvotes: 4

Related Questions