Hoppity81
Hoppity81

Reputation: 111

How can I split a dataframe using sklearn train test split such that there are equal proportions for each category?

I have a dataset with n independent variables and a categorical variable that I would like to perform a regression analysis on. The number of rows of data is different for each category. I would like to split the dataset into test and train data sets such that each category has an equivalent train test split, e.g. 80% to 20%. Here is a simplified reproducible example of what I'm doing.

import pandas as pd
import string 
import numpy as np

from sklearn.model_selection import train_test_split

nrows=1000

cat_values = ['A','B','C','D']
# defining the category names
cats = np.random.choice(cat_values,  size=(nrows))

# creating a random dataframe
df = pd.DataFrame(np.random.randint(0,1000,size=(nrows, 3)), columns=['variable 1','variable 2','variable 3'])
df['category'] = cats

y = np.random.rand(nrows)

# using sklearn to split into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = .2, random_state =0)

# printing the number of rows in the output training data set for each category 
for i in range(len(cat_values)):
    print ("number of rows in category " + str(cat_values[i]) + ": " +  str(len(X_train[X_train['category']==cat_values[i]])))

Output:

number of rows in category A: 221
number of rows in category B: 188
number of rows in category C: 179
number of rows in category D: 212

I would like the rows to be split e.g. 80:20 train:test for each categorical variable. I've looked at using StratifiedShuffleSplit (Train/test split preserving class proportions in each split) but there doesn't seem to be an option for specifying which column to stratify the split on (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html).

Is there a package that can split the data this way, or would I have to divide my dataframe into n categorical dataframes and perform a different train test split on each one before rejoining them?

Thanks for any assistance with this.

Upvotes: 1

Views: 2705

Answers (1)

jcaliz
jcaliz

Reputation: 4021

Use train_test_split using stratify parameter:

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=.2, random_state=0, stratify=y
)

Upvotes: 1

Related Questions