armin
armin

Reputation: 651

Random selection of value in a dataframe with multiple conditions

Let's say I have a dataframe like this.

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.choice(list(['a', 'b', 'c', 'd']), 50), columns=list('1'))
print(df.value_counts())
1
d    18
a    12
b    12
c     8
dtype: int64

Now, what I am trying is to do a sampling given the frequency of each value in the column. For example, if the count of a value is below 8 (here value c), then select 50% the rows, if between 8 and 12, then select 40%, and >12, 30%.

Here is what I though might be a way to do it, but this does not produce what I am exactly looking for.

sample_df = df.groupby('1').apply(lambda x: x.sample(frac=.2)).reset_index(drop=True)
print(sample_df.value_counts())
1
d    4
a    2
b    2
c    2

Upvotes: 0

Views: 116

Answers (2)

Shirin Yavari
Shirin Yavari

Reputation: 682

Let's start by creating the weight function that you have in mind as:

def weigth_fun(freq):
    if freq <= 8:
        w =  0.5
    elif freq > 8 and freq<= 12:
        w =  0.4
    elif freq > 12:
        w = 0.3
    else:
        print('wrong frequency value!')
    return w

The weigth_fun needs to take the frequency(freq) of the number in each row in your dataframe as the argument to assign it a weight (w). Now it's time to create the frequencies, which will be given by the line below, which we will later use inside the apply:

(df.groupby('1')['1'].transform('count')).head(6)

Which outputs the below weights:

enter image description here

Now that we know what the line above does, we enter it into the apply method:

sample=df.groupby('1', group_keys=False).apply(lambda x: x.sample(weights=(df.groupby('1')['1'].transform('count')).apply(weigth_fun)))
print(sample)

which results in:

enter image description here

Upvotes: 1

sconfluentus
sconfluentus

Reputation: 4993

If you are trying to do a proportional subset, where the percent of the specific categories are preserved, you can choose some percent, say 25% of the whole set to sample using this code:

sample=df.groupby('column_category', group_keys=False).apply(lambda x: x.sample(frac=0.25))

This is going to take 25% of the whole set maintaining the proportions of the column column_category.

Upvotes: 0

Related Questions