Stack
Stack

Reputation: 1129

Get only a sample of a DataFrame with Pandas

I have a DataFrame, which has two columns content and target. The DataFrame is 1.000.000 rows long and this is the distribution for the target:

I need to get the same number of target values in the DataFrame, so I'm using sample() to get only a fixed number of rows that has target o x, this is the code I'm using:

samples = 100000
dataset[dataset['target'] == 0] = dataset[dataset['target'] == 0].sample(n=samples, axis=0)
dataset[dataset['target'] == 1] = dataset[dataset['target'] == 1].sample(n=samples, axis=0)
dataset[dataset['target'] == 2] = dataset[dataset['target'] == 2].sample(n=samples, axis=0)

If after the execution I access dataset[dataset['target'] == 2] for example, it will return a shape of (100000, 2) but accessing dataset only returns the exact same dataset as before, what I'm doing wrong?

Upvotes: 1

Views: 180

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

Since you are sampling fewer samples in each Target, you can use groupby().sample:

# random_state for repeatability. Remove if needed
dataset = dataset.groupby('target').sample(n=samples, random_state=42)

Upvotes: 2

Related Questions