Reputation: 1204

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1

I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.

I was thinking of using something like this:

df.sample(n=10000, weights='target', random_state=1)

Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!

Upvotes: 20

Answers (4)

Vaishali

Reputation: 38425

You can group the data by target and then sample,

df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)

new_df.target.value_counts()

1    5000
0    5000

Edit: Use DataFrame.sample

You get similar results using DataFrame.sample

new_df = df.groupby('target').sample(n=5000)

Upvotes: 35

Ahmad

Reputation: 9678

You can use DataFrameGroupBy.sample method as follwing:

sample_df = df.groupby("target").sample(n=5000, random_state=1)

Upvotes: 13

mlenthusiast

Reputation: 1204

Also found this to be a good method:

df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')

Change the value of frac depending on the percent of data you want back from the original dataframe.

Upvotes: 7

Beauregard D

Reputation: 107

You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.

Upvotes: 1

Sample Pandas dataframe based on values in column

Answers (4)

Related Questions