Tengerye
Tengerye

Reputation: 1964

What is sample weight in sklearn.model_selection.train_test_split

In probability calibration of classifiers from scikit-learn, there is a section of codes about train_test_split which I could not find explanation in documents.

centers = [(-5, -5), (0, 0), (5, 5)] X, y =
make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                  centers=centers, shuffle=False, random_state=42)

y[:n_samples // 2] = 0 
y[n_samples // 2:] = 1 
sample_weight = np.random.RandomState(42).rand(y.shape[0])

# split train, test for calibration 
X_train, X_test, y_train, y_test, sw_train, sw_test = \
    train_test_split(X, y, sample_weight, test_size=0.9, random_state=42)
  1. What does sample_weight in train_test_split do?

  2. How does the source code of train_test_split process sample_weight?

Thanks a lot in advance.

Upvotes: 2

Views: 4777

Answers (1)

NPE
NPE

Reputation: 500923

train_test_split doesn't just take x and y. It can take an arbitrary sequence of arrays that have the same first dimension and split them randomly, but consistently, into two sets along that dimension.

In your example there's an array of random weights (one weight per observation) that gets split into training and test arrays, sw_train and sw_test.

There are many reasons to assign weights to observations. For further discussion, see:

Upvotes: 3

Related Questions