Reputation: 111
I have a dataset with n independent variables and a categorical variable that I would like to perform a regression analysis on. The number of rows of data is different for each category. I would like to split the dataset into test and train data sets such that each category has an equivalent train test split, e.g. 80% to 20%. Here is a simplified reproducible example of what I'm doing.
import pandas as pd
import string
import numpy as np
from sklearn.model_selection import train_test_split
nrows=1000
cat_values = ['A','B','C','D']
# defining the category names
cats = np.random.choice(cat_values, size=(nrows))
# creating a random dataframe
df = pd.DataFrame(np.random.randint(0,1000,size=(nrows, 3)), columns=['variable 1','variable 2','variable 3'])
df['category'] = cats
y = np.random.rand(nrows)
# using sklearn to split into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = .2, random_state =0)
# printing the number of rows in the output training data set for each category
for i in range(len(cat_values)):
print ("number of rows in category " + str(cat_values[i]) + ": " + str(len(X_train[X_train['category']==cat_values[i]])))
Output:
number of rows in category A: 221
number of rows in category B: 188
number of rows in category C: 179
number of rows in category D: 212
I would like the rows to be split e.g. 80:20 train:test for each categorical variable. I've looked at using StratifiedShuffleSplit (Train/test split preserving class proportions in each split) but there doesn't seem to be an option for specifying which column to stratify the split on (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html).
Is there a package that can split the data this way, or would I have to divide my dataframe into n categorical dataframes and perform a different train test split on each one before rejoining them?
Thanks for any assistance with this.
Upvotes: 1
Views: 2705
Reputation: 4021
Use train_test_split
using stratify
parameter:
X_train, X_test, y_train, y_test = train_test_split(
df, y, test_size=.2, random_state=0, stratify=y
)
Upvotes: 1