What is sample weight in sklearn.model_selection.train_test_split

Question

In probability calibration of classifiers from scikit-learn, there is a section of codes about train_test_split which I could not find explanation in documents.

centers = [(-5, -5), (0, 0), (5, 5)] X, y =
make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                  centers=centers, shuffle=False, random_state=42)

y[:n_samples // 2] = 0 
y[n_samples // 2:] = 1 
sample_weight = np.random.RandomState(42).rand(y.shape[0])

# split train, test for calibration 
X_train, X_test, y_train, y_test, sw_train, sw_test = \
    train_test_split(X, y, sample_weight, test_size=0.9, random_state=42)

What does sample_weight in train_test_split do?
How does the source code of train_test_split process sample_weight?

Thanks a lot in advance.

NPE · Accepted Answer

train_test_split doesn't just take x and y. It can take an arbitrary sequence of arrays that have the same first dimension and split them randomly, but consistently, into two sets along that dimension.

In your example there's an array of random weights (one weight per observation) that gets split into training and test arrays, sw_train and sw_test.

There are many reasons to assign weights to observations. For further discussion, see:

What is sample weight in sklearn.model_selection.train_test_split

Answers (1)

Related Questions