Reputation: 3263
Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible.
Say I have a dataframe below, the sample size is 3
and target column is c
a | b | c
1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2
One of possible samples would be
a | b | c
1 | 2 | 0
5 | 6 | 1
7 | 8 | 2
In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.
How would I approach this in pandas?
EDIT: provided solution that worked for me in answers
Upvotes: 4
Views: 2189
Reputation: 139
First, we create your example dataframe
columns = ['a', 'b', 'c']
data = [[1, 2, 0], [4, 4, 0], [5, 6, 1], [7, 8, 2], [9, 10, 2], [11, 12, 2]]
df = pd.DataFrame(data = data, columns = columns)
Now, with the following function you can do what you want
def balanced_sample(dataframe, sample_size, target_column):
# extract existing possible classes
target_columns_values = dataframe.loc[:, target_column].unique().tolist()
# count number of classes
target_columns_unique_classes_size = len(target_columns_values)
# checking if sample size is multiple of number of classes
if sample_size%target_columns_unique_classes_size !=0:
print('Sample size is not a multiple of the number of unique classes')
# to have difference in 1 item or so
instances_per_class = round(sample_size/target_columns_unique_classes_size)
# other possibilitie is to use
# sample_size//target_columns_unique_classes_size instead of round(...)
# but then, instances_per_class will be always <= than
# sample_size/target_columns_unique_classes_size
# checking if there is enought examples per class
values_per_class = dataframe.loc[:, target_column].value_counts()
for idx in values_per_class.index:
if instances_per_class>values_per_class[idx]:
print('Class {} has only {} example, so it is impossible to use {}
sample size, i.e., {} per class'.format(idx, values_per_class[idx],
sample_size, instances_per_class))
return pd.DataFrame(columns = dataframe.columns)
# creating the result dataframe
data = []
for classes in target_columns_values:
class_values = dataframe[dataframe.loc[:, target_column] ==
classes].sample(instances_per_class).values.tolist()
data+=class_values
result_dataframe = pd.DataFrame(columns = dataframe.columns, data = data)
return result_dataframe
Now we check the function:
And with other options:
I hope you find it useful, if you have any doubt, comment it here and I will try to answer you.
Upvotes: 1
Reputation: 435
You can just get a random sample of the dataframe based on the minimum count of the target column.
column = 'c'
df = df.groupby(column).sample(n=df[column].value_counts().min(), random_state='42')
Upvotes: 2
Reputation: 3263
I am posting the solution that works for me. It is not the most beautiful or efficient code. But that's honest work.
df = pd.read_csv(path)
target_col = 't'
unique_values = df[target_col].unique()
k = 8 #sample size
per_class_sample_size = int(k/unique_values.shape[0])
arr_samples_per_class = [0] * len(unique_values)
leftover = k - (per_class_sample_size * len(unique_values))
for i, v in enumerate(unique_values):
occ = df[df[target_col] == v].shape[0]
if leftover > 0 and occ > per_class_sample_size:
sz = per_class_sample_size + 1
leftover -= 1
else:
sz = per_class_sample_size if occ >= per_class_sample_size else occ
arr_samples_per_class[i] = sz
fdf = None
for v, sz in zip(unique_values, arr_samples_per_class):
ss = df.loc[df[target_col] == v].sample(sz)
fdf = ss if fdf is None else pd.concat([fdf, ss], axis=0)
Upvotes: 0
Reputation: 559
I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements
unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
sample_sizes[i]+= 1
i= I+1
This bit generates the samples based on the generated sample sizes
df2= pd.concat([df.loc[df['c'] == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])
Upvotes: 2
Reputation: 17152
Question is a bit ambiguous but let say you want to randomly select 1 row for each column c
category one could do:
import pandas as pd
data = [
[1, 2, 0], [1, 4, 0], [2, 2, 1],
[4, 5, 1], [3, 7, 2], [3, 3, 2],
[1, 2, 6], [3, 2, 6], [5, 2, 6]
]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
sample = df.groupby('c').apply(lambda x: x.sample(n=1).squeeze())
a b c
c
0 1 4 0
1 2 2 1
2 3 3 2
6 1 2 6
Upvotes: 0