Reputation: 651
Let's say I have a dataframe like this.
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.choice(list(['a', 'b', 'c', 'd']), 50), columns=list('1'))
print(df.value_counts())
1
d 18
a 12
b 12
c 8
dtype: int64
Now, what I am trying is to do a sampling given the frequency of each value in the column. For example, if the count of a value is below 8 (here value c), then select 50% the rows, if between 8 and 12, then select 40%, and >12, 30%.
Here is what I though might be a way to do it, but this does not produce what I am exactly looking for.
sample_df = df.groupby('1').apply(lambda x: x.sample(frac=.2)).reset_index(drop=True)
print(sample_df.value_counts())
1
d 4
a 2
b 2
c 2
Upvotes: 0
Views: 116
Reputation: 682
Let's start by creating the weight function that you have in mind as:
def weigth_fun(freq):
if freq <= 8:
w = 0.5
elif freq > 8 and freq<= 12:
w = 0.4
elif freq > 12:
w = 0.3
else:
print('wrong frequency value!')
return w
The weigth_fun needs to take the frequency(freq) of the number in each row in your dataframe as the argument to assign it a weight (w). Now it's time to create the frequencies, which will be given by the line below, which we will later use inside the apply:
(df.groupby('1')['1'].transform('count')).head(6)
Which outputs the below weights:
Now that we know what the line above does, we enter it into the apply method:
sample=df.groupby('1', group_keys=False).apply(lambda x: x.sample(weights=(df.groupby('1')['1'].transform('count')).apply(weigth_fun)))
print(sample)
which results in:
Upvotes: 1
Reputation: 4993
If you are trying to do a proportional subset, where the percent of the specific categories are preserved, you can choose some percent, say 25% of the whole set to sample using this code:
sample=df.groupby('column_category', group_keys=False).apply(lambda x: x.sample(frac=0.25))
This is going to take 25% of the whole set maintaining the proportions of the column column_category
.
Upvotes: 0