Reputation: 4139
In h2o flow
, is there a way to ensure that my data frame splits have a controlled ratio of response classes.
For example, say I plan to train a binary classifier on a data frame X where 0_class_ratio% of the samples are in class 0 and 1_class_ratio% are in class 1. I want to split X into frame splits X_train and X_test by ratios 0.75 and 0.25, respectively. How would I be able to ensure that both X_train and X_test are comprised 0_class_ratio% of samples in category 0 and 1_class_ratio% of samples in category 1?
In python's scikit-learn package I would do something like:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=rng_seed_)
# go thru all (split and shuffled) indices of my_data dataframe stratified by response_class values
for train_index, test_index in split.split(my_data, my_data["response_class"]):
strat_train_set = my_data.loc[train_index]
strat_test_set = my_data.loc[test_index]
I am aware of the h2o
hyper-parameters sample_rate and sample_rate_per_class, but I'm not fully sure how to use them in this situation.
Upvotes: 0
Views: 1222