Split data into train, test, validation with stratifying using Numpy

Question

I've just seen this answer on SO which shows how to split data using numpy.

Assume we're going to split them as 0.8, 0.1, 0.1 for training, testing, and validation respectively, you do it this way:

train, test, val = np.split(df, [int(.8 * len(df)), int(.9 * len(df))])

I'm interested to know how could I consider stratifying while splitting data using this methodology.

Stratifying is splitting data while keeping the priors of each class you have in data. That is if you're going to take 0.8 for the training set, you take 0.8 from each class you have. Same for test and train.

I tried grouping the data first by class using:

grouped_df = df.groupby(class_col_name, group_keys=False)

But it did not show correct results.

Note: I'm familiar with train_test_split

Parfait · Accepted Answer

Simply use your groupby object, grouped_df, which consists of each subsetted data frame where you can then run the needed np.split. Then concatenate all sampled data frames with pd.concat. Atogether, this would stratify according to your quoted message:

train_list = []; test_list = [], val_list = []
grouped_df = df.groupby(class_col_name)

# ITERATE THROUGH EACH SUBSET DF
for i, g in grouped_df:
    # STRATIFY THE g (CLASS) DATA FRAME
    train, test, val = np.split(g, [int(.8 * len(g)), int(.9 * len(g))])

    train_list.append(train); test_list.append(test); val_list.append(val)

final_train = pd.concat(train_list)
final_test = pd.concat(test_list)
final_val = pd.concat(val_list)

Alternatively, a short-hand version using list comprehensions:

# LIST OF ARRAYS
arr_list = [np.split(g, [int(.8 * len(g)), int(.9 * len(g))]) for i, g in grouped_df]

final_train = pd.concat([t[0] for t in arr_list])
final_test = pd.concat([t[1] for t in arr_list])
final_val = pd.concat([v[2] for v in arr_list])

Split data into train, test, validation with stratifying using Numpy

Answers (2)

Related Questions