YohanRoth
YohanRoth

Reputation: 3263

Balanced row sample from dataframe with pandas given categorical target column

Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible. Say I have a dataframe below, the sample size is 3 and target column is c

a | b | c

1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2

One of possible samples would be

a | b | c

1 | 2 | 0
5 | 6 | 1
7 | 8 | 2

In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.

How would I approach this in pandas?

EDIT: provided solution that worked for me in answers

Upvotes: 4

Views: 2189

Answers (5)

Iker
Iker

Reputation: 139

First, we create your example dataframe

columns = ['a', 'b', 'c']
data = [[1, 2, 0], [4, 4, 0], [5, 6, 1], [7, 8, 2], [9, 10, 2], [11, 12, 2]]
df = pd.DataFrame(data = data, columns = columns)

enter image description here

Now, with the following function you can do what you want

def balanced_sample(dataframe, sample_size, target_column):
    # extract existing possible classes
    target_columns_values = dataframe.loc[:, target_column].unique().tolist()

    # count number of classes
    target_columns_unique_classes_size = len(target_columns_values)

    # checking if sample size is multiple of number of classes
    if sample_size%target_columns_unique_classes_size !=0:
        print('Sample size is not a multiple of the number of unique classes')

    # to have difference in 1 item or so
    instances_per_class = round(sample_size/target_columns_unique_classes_size)
    # other possibilitie is to use 
    # sample_size//target_columns_unique_classes_size instead of round(...)
    # but then, instances_per_class will be always <= than 
    # sample_size/target_columns_unique_classes_size

    # checking if there is enought examples per class
    values_per_class =  dataframe.loc[:, target_column].value_counts()
    for idx in values_per_class.index:
        if instances_per_class>values_per_class[idx]:
            print('Class {} has only {} example, so it is impossible to use {} 
            sample size, i.e., {} per class'.format(idx, values_per_class[idx], 
            sample_size, instances_per_class))
            return pd.DataFrame(columns = dataframe.columns)

    # creating the result dataframe
    data = []
    for classes in target_columns_values:
        class_values = dataframe[dataframe.loc[:, target_column] == 
        classes].sample(instances_per_class).values.tolist()
        data+=class_values
    result_dataframe = pd.DataFrame(columns = dataframe.columns, data = data)
    return result_dataframe

Now we check the function:

enter image description here

And with other options:

enter image description here

enter image description here

I hope you find it useful, if you have any doubt, comment it here and I will try to answer you.

Upvotes: 1

Jo&#227;o Victor
Jo&#227;o Victor

Reputation: 435

You can just get a random sample of the dataframe based on the minimum count of the target column.

column = 'c'
df = df.groupby(column).sample(n=df[column].value_counts().min(), random_state='42')

Upvotes: 2

YohanRoth
YohanRoth

Reputation: 3263

I am posting the solution that works for me. It is not the most beautiful or efficient code. But that's honest work.

df = pd.read_csv(path)
target_col = 't'
unique_values = df[target_col].unique()
k = 8 #sample size
per_class_sample_size = int(k/unique_values.shape[0])
arr_samples_per_class = [0] * len(unique_values)
leftover = k - (per_class_sample_size * len(unique_values))

for i, v in enumerate(unique_values):
    occ = df[df[target_col] == v].shape[0]
    if leftover > 0 and occ > per_class_sample_size:
        sz = per_class_sample_size + 1
        leftover -= 1
    else:
        sz = per_class_sample_size if occ >= per_class_sample_size else occ
    arr_samples_per_class[i] = sz

fdf = None
for v, sz in zip(unique_values, arr_samples_per_class):
    ss = df.loc[df[target_col] == v].sample(sz)
    fdf = ss if fdf is None else pd.concat([fdf, ss], axis=0)

Upvotes: 0

Mayowa Ayodele
Mayowa Ayodele

Reputation: 559

I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements

unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
    sample_sizes[i]+= 1
    i= I+1

This bit generates the samples based on the generated sample sizes

df2= pd.concat([df.loc[df['c']  == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])

Upvotes: 2

Steven G
Steven G

Reputation: 17152

Question is a bit ambiguous but let say you want to randomly select 1 row for each column c category one could do:

import pandas as pd

data = [
    [1, 2, 0], [1, 4, 0], [2, 2, 1],
    [4, 5, 1], [3, 7, 2], [3, 3, 2],
    [1, 2, 6], [3, 2, 6],  [5, 2, 6]
]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])

sample = df.groupby('c').apply(lambda x: x.sample(n=1).squeeze())

   a  b  c
c         
0  1  4  0
1  2  2  1
2  3  3  2
6  1  2  6

Upvotes: 0

Related Questions