Reputation: 1964
In probability calibration of classifiers from scikit-learn, there is a section of codes about train_test_split which I could not find explanation in documents.
centers = [(-5, -5), (0, 0), (5, 5)] X, y =
make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
centers=centers, shuffle=False, random_state=42)
y[:n_samples // 2] = 0
y[n_samples // 2:] = 1
sample_weight = np.random.RandomState(42).rand(y.shape[0])
# split train, test for calibration
X_train, X_test, y_train, y_test, sw_train, sw_test = \
train_test_split(X, y, sample_weight, test_size=0.9, random_state=42)
What does
sample_weight
intrain_test_split
do?
How does the source code of
train_test_split
processsample_weight
?
Thanks a lot in advance.
Upvotes: 2
Views: 4777
Reputation: 500923
train_test_split
doesn't just take x
and y
. It can take an arbitrary sequence of arrays that have the same first dimension and split them randomly, but consistently, into two sets along that dimension.
In your example there's an array of random weights (one weight per observation) that gets split into training and test arrays, sw_train
and sw_test
.
There are many reasons to assign weights to observations. For further discussion, see:
Upvotes: 3