Caitlin
Caitlin

Reputation: 589

sklearn train_test_split on pandas stratify by multiple columns

I'm a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to split into a training and test set. I would like to stratify my data by at least 2, but ideally 4 columns in my dataframe.

There were no warnings from sklearn when I tried to do this, however I found later that there were repeated rows in my final data set. I created a sample test to show this behavior:

from sklearn.model_selection import train_test_split
a = np.array([i for i in range(1000000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

It seems to work as expected if I stratify by either column:

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

But when I try to stratify by both columns, I get repeated values:

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 640000

Upvotes: 59

Views: 86917

Answers (5)

Samuel Nde
Samuel Nde

Reputation: 2743

Right now, you can achieve this by simply passing a list of columns to use like I do below. Assume you have 2 columns Sex_M and Sex_F, you can try to do something like.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=622, stratify=X[['Sex_I', 'Sex_M']])

This works because stratify is array-like as explained in the train_test_split documentation.

Upvotes: 0

grofte
grofte

Reputation: 2119

You need to iteratively split your data. There's a class for it in scikit-multilearn. Bit annoying that it only works on NumPy arrays but what can you do?

Here's a function that should do what you are asking for:

import pandas as pd
from skmultilearn.model_selection import IterativeStratification

def iterative_split(df, test_size, stratify_columns):
    """Custom iterative train test split which
    'maintains balanced representation with respect
    to order-th label combinations.'

    From https://madewithml.com/courses/mlops/splitting/#stratified-split
    """
    # One-hot encode the stratify columns and concatenate them
    one_hot_cols = [pd.get_dummies(df[col]) for col in stratify_columns]
    one_hot_cols = pd.concat(one_hot_cols, axis=1).to_numpy()
    stratifier = IterativeStratification(
        n_splits=2, order=len(stratify_columns), sample_distribution_per_fold=[test_size, 1-test_size])
    train_indices, test_indices = next(stratifier.split(df.to_numpy(), one_hot_cols))
    # Return the train and test set dataframes
    train, test = df.iloc[train_indices], df.iloc[test_indices]
    return train, test

example = pd.DataFrame({'a': [1, 2, 3]*8*2, 'b': [4, 5, 6, 7]*6*2, 'c': [7, 8]*12*2})
train, test = iterative_split(example, 0.4, ['a', 'b'])
# print(f'{train =}')
# print(f'{test =}')

print(f'{train[["a"]].value_counts() =}')
print(f'{test[["a"]].value_counts()  =}')
print(f'{train[["b"]].value_counts() =}')
print(f'{test[["b"]].value_counts()  =}')

Output

train[["a"]].value_counts() =a
1    10
2    10
3    10
dtype: int64
test[["a"]].value_counts()  =a
1    6
2    6
3    6
dtype: int64
train[["b"]].value_counts() =b
5    8
6    8
4    7
7    7
dtype: int64
test[["b"]].value_counts()  =b
4    5
7    5
5    4
6    4
dtype: int64

And for your example we can add this code:

import numpy as np

a = np.array([i for i in range(10_000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

train, test = iterative_split(df, test_size=0.2, stratify_columns=['b', 'c'])

print(len(train.a.values))  # prints 8000
print(len(set(train.a.values)))  # prints 8000

The one_hot_cols becomes a matrix of 1e6 x 3e5 in your example and that was a bit much. If someone comes up with a better way then I am all ears.

Upvotes: 3

Sesquipedalism
Sesquipedalism

Reputation: 1733

If you want train_test_split to behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.

df['bc'] = df['b'].astype(str) + df['c'].astype(str)
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])

If you're worried about collision due to values like 11 and 3 and 1 and 13 both creating a concatenated value of 113, then you can add some arbitrary string in the middle:

df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)

Upvotes: 65

Louis T
Louis T

Reputation: 671

What version of scikit-learn are you using ? You can use sklearn.__version__ to check.

The prior to version 0.19.0, scikit-learn does not handle 2-dimensional stratification correctly. It is patched in 0.19.0.

It is describled in issue #9044.

Update your scikit-learn should fix the problem. If you can't update your scikit-learn, see this commit history here for the fix.

Upvotes: 12

andrew_reece
andrew_reece

Reputation: 21274

The reason you're getting duplicates is because train_test_split() eventually defines strata as the unique set of values of whatever you passed into the stratify argument. Since strata are defined from two columns, one row of data may represent more than one stratum, and so sampling may choose the same row twice because it thinks it's sampling from different classes.

The train_test_split() function calls StratifiedShuffleSplit, which uses np.unique() on y (which is what you pass in via stratify). From the source code:

classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]

Here's a simplified sample case, a variation on the example you provided:

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

print(df)
     a    b  c
0    0  bar  y
1    1  foo  y
2    2  bar  z
3    3  bar  y
4    4  foo  z
5    5  bar  y
...

The stratification function thinks there are four classes to split on: foo, bar, y, and z. But since these classes are essentially nested, meaning y and z both show up in b == foo and b == bar, we'll get duplicates when the splitter tries to sample from each class.

train, test = train_test_split(df, test_size=0.2, random_state=0, 
                               stratify=df[['b', 'c']])
print(len(train.a.values))  # 16
print(len(set(train.a.values)))  # 12

print(train)
     a    b  c
3    3  bar  y   # selecting a = 3 for b = bar*
5    5  bar  y
13  13  foo  y
4    4  foo  z
14  14  bar  z
10  10  foo  z
3    3  bar  y   # selecting a = 3 for c = y
6    6  bar  y
16  16  foo  y
18  18  bar  z
6    6  bar  y
8    8  foo  y
18  18  bar  z
7    7  bar  z
4    4  foo  z
19  19  bar  y

#* We can't be sure which row is selecting for `bar` or `y`, 
#  I'm just illustrating the idea here.

There's a larger design question here: Do you want to used nested stratified sampling, or do you actually just want to treat each class in df.b and df.c as a separate class to sample from? If the latter, that's what you're already getting. The former is more complicated, and that's not what train_test_split is set up to do.

You might find this discussion of nested stratified sampling useful.

Upvotes: 37

Related Questions