Reputation: 1129
I have a DataFrame, which has two columns content
and target
.
The DataFrame is 1.000.000 rows long and this is the distribution for the target:
I need to get the same number of target values in the DataFrame, so I'm using sample()
to get only a fixed number of rows that has target o x
, this is the code I'm using:
samples = 100000
dataset[dataset['target'] == 0] = dataset[dataset['target'] == 0].sample(n=samples, axis=0)
dataset[dataset['target'] == 1] = dataset[dataset['target'] == 1].sample(n=samples, axis=0)
dataset[dataset['target'] == 2] = dataset[dataset['target'] == 2].sample(n=samples, axis=0)
If after the execution I access dataset[dataset['target'] == 2]
for example, it will return a shape of (100000, 2) but accessing dataset
only returns the exact same dataset as before, what I'm doing wrong?
Upvotes: 1
Views: 180
Reputation: 150735
Since you are sampling fewer samples in each Target
, you can use groupby().sample
:
# random_state for repeatability. Remove if needed
dataset = dataset.groupby('target').sample(n=samples, random_state=42)
Upvotes: 2