Get only a sample of a DataFrame with Pandas

Question

I have a DataFrame, which has two columns content and target. The DataFrame is 1.000.000 rows long and this is the distribution for the target:

Target = 0 (100.000)
Target = 1 (400.000)
Target = 2 (500.000)

I need to get the same number of target values in the DataFrame, so I'm using sample() to get only a fixed number of rows that has target o x, this is the code I'm using:

samples = 100000
dataset[dataset['target'] == 0] = dataset[dataset['target'] == 0].sample(n=samples, axis=0)
dataset[dataset['target'] == 1] = dataset[dataset['target'] == 1].sample(n=samples, axis=0)
dataset[dataset['target'] == 2] = dataset[dataset['target'] == 2].sample(n=samples, axis=0)

If after the execution I access dataset[dataset['target'] == 2] for example, it will return a shape of (100000, 2) but accessing dataset only returns the exact same dataset as before, what I'm doing wrong?

Quang Hoang · Accepted Answer

Since you are sampling fewer samples in each Target, you can use groupby().sample:

# random_state for repeatability. Remove if needed
dataset = dataset.groupby('target').sample(n=samples, random_state=42)

Get only a sample of a DataFrame with Pandas

Answers (1)

Related Questions