Shiva
Shiva

Reputation: 73

How to select a subset of pandas dataframe containing an even distribution of one column's values?

I have a huge dataset over different years. As a subsample for local tests, I need to separate a small dataframe which contains only a few samples distributed over years. Does anyone have any idea how to do that?

After groupby by 'year' column, the count of instances in each year is something like:

year A
1838 1000
1839 2600
1840 8900
1841 9900

I want to select a subset which after groupby looks like:

| year| A |
| ----| --|
| 1838| 10|
| 1839| 10|
| 1840| 10|
| 1841| 10|

Upvotes: 1

Views: 481

Answers (1)

Juancheeto
Juancheeto

Reputation: 586

Try groupby().sample().

Here's example usage with dummy data.

import numpy as np 
import pandas as pd 
# create a long array of 'years' from 1800 to 1805
years = np.random.randint(low=1800,high=1805,size=200)
values = np.random.randint(low=1, high=200,size=200) 
df = pd.DataFrame({'Years':years,"Values":values})
number_per_year = 10
sample_df = df.groupby("Years").sample(n=number_per_year, random_state=1)

Upvotes: 1

Related Questions