Reputation: 2217
I have a dataframe pd
with two columns, X
and y
.
In pd[y]
I have integers from 1
to 10
inclusive. However they have different frequencies:
df[y].value_counts()
10 6645
9 6213
8 5789
7 4643
6 2532
5 1839
4 1596
3 878
2 815
1 642
I want to cut down my dataframe so that there are equal number of occurrences for each label. As I want an equal number of each label, the minimum frequency is 642
. So I only want to keep 642
randomly sampled rows of each class label in my dataframe so that my new dataframe has 642
for each class label.
I thought this might have helped however stratifying only keeps the same percentage of each label but I want all my labels to have the same frequency.
As an example of a dataframe:
df = pd.DataFrame()
df['y'] = sum([[10]*6645, [9]* 6213,[8]* 5789, [7]*4643,[6]* 2532, [5]*1839,[4]* 1596,[3]* 878, [2]*815, [1]* 642],[])
df['X'] = [random.choice(list('abcdef')) for i in range(len(df))]
Upvotes: 2
Views: 419
Reputation: 9081
df = pd.DataFrame(np.random.randint(1, 11, 100), columns=['y'])
val_cnt = df['y'].value_counts()
min_sample = val_cnt.min()
print(min_sample) # Outputs 7 in as an example
print(df.groupby('y').apply(lambda s: s.sample(min_sample)))
Output
y
y
1 68 1
8 1
82 1
17 1
99 1
31 1
6 1
2 55 2
15 2
81 2
22 2
46 2
13 2
58 2
3 2 3
30 3
84 3
61 3
78 3
24 3
98 3
4 51 4
86 4
52 4
10 4
42 4
80 4
53 4
5 16 5
87 5
... ..
6 26 6
18 6
7 56 7
4 7
60 7
65 7
85 7
37 7
70 7
8 93 8
41 8
28 8
20 8
33 8
64 8
62 8
9 73 9
79 9
9 9
40 9
29 9
57 9
7 9
10 96 10
67 10
47 10
54 10
97 10
71 10
94 10
[70 rows x 1 columns]
Upvotes: 4