Test_train_split with stratify

Question

I am trying to split by dataframe (~188k rows) into train and test sample. The column ('FLAG') is my target variable containing a value of either 0 or 1.

Since there are only about 1300 'FLAG' with value 1, I want to do a stratified split to ensure there is a representative number of 1 values in both samples.

I tried to split using sklearn's train_test_split function:

train, test = train_test_split(df, test_size=0.2, stratify=df["FLAG"])

My problem is, that the resulting train and test sample have 177942, respectively 52 rows. I would have expected something like 150400 and 37600 rows.

My understanding from reading the documentation (sklearn.model_selection.train_test_split) is that I have to provide my dataframe, the test_size and the column containing the target classes (i.e. 'FLAG' in my case).

Even a generic example:

df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': np.random.rand(100000), 'c': 0})
df.loc[np.random.randint(0, 100000, 1000), 'c'] = 1
tr, ts = train_test_split(df, test_size=.2, stratify=df['c'])
print(tr.shape, ts.shape)

Returns: (93105, 3) (38, 3)

My list of imports:

import cx_Oracle
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

My python version: 3.7.0 Sklearn version: 0.20.3 Pandas version: 0.23.4

Test_train_split with stratify

Answers (1)

Related Questions