Stratified sampling on response categories in h2o flow data splits

Question

In h2o flow, is there a way to ensure that my data frame splits have a controlled ratio of response classes.

For example, say I plan to train a binary classifier on a data frame X where 0_class_ratio% of the samples are in class 0 and 1_class_ratio% are in class 1. I want to split X into frame splits X_train and X_test by ratios 0.75 and 0.25, respectively. How would I be able to ensure that both X_train and X_test are comprised 0_class_ratio% of samples in category 0 and 1_class_ratio% of samples in category 1?

In python's scikit-learn package I would do something like:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=rng_seed_)
# go thru all (split and shuffled) indices of my_data dataframe stratified by response_class values
for train_index, test_index in split.split(my_data, my_data["response_class"]):
    strat_train_set = my_data.loc[train_index]
    strat_test_set = my_data.loc[test_index]

I am aware of the h2o hyper-parameters sample_rate and sample_rate_per_class, but I'm not fully sure how to use them in this situation.

Erin LeDell · Accepted Answer

Set fold_assignment to "Stratified".

.

Stratified sampling on response categories in h2o flow data splits

Answers (1)

Related Questions