tk78
tk78

Reputation: 957

Test_train_split with stratify

I am trying to split by dataframe (~188k rows) into train and test sample. The column ('FLAG') is my target variable containing a value of either 0 or 1.

Since there are only about 1300 'FLAG' with value 1, I want to do a stratified split to ensure there is a representative number of 1 values in both samples.

I tried to split using sklearn's train_test_split function:

train, test = train_test_split(df, test_size=0.2, stratify=df["FLAG"])

My problem is, that the resulting train and test sample have 177942, respectively 52 rows. I would have expected something like 150400 and 37600 rows.

My understanding from reading the documentation (sklearn.model_selection.train_test_split) is that I have to provide my dataframe, the test_size and the column containing the target classes (i.e. 'FLAG' in my case).

Even a generic example:

df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': np.random.rand(100000), 'c': 0})
df.loc[np.random.randint(0, 100000, 1000), 'c'] = 1
tr, ts = train_test_split(df, test_size=.2, stratify=df['c'])
print(tr.shape, ts.shape)

Returns: (93105, 3) (38, 3)

My list of imports:

import cx_Oracle
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

My python version: 3.7.0 Sklearn version: 0.20.3 Pandas version: 0.23.4

Upvotes: 3

Views: 1901

Answers (1)

tk78
tk78

Reputation: 957

My investigations showed that the issue is caused by an integer overflow. The issue is happening only on Python 3.7.x 32bit. The 64bit version works fine.

In the end I switched to 64bit Python to resolve the issue (I previously had to use 32bit version due to an unrelated Oracle package dependency).

Upvotes: 1

Related Questions