Ziqi
Ziqi

Reputation: 2544

scikit learn: train_test_split, can I ensure same splits on different datasets

I understand that the train_test_split method splits a dataset into random train and test subsets. And using random_state=int can ensure we have the same splits on this dataset for each time the method is called.

My problem is slightly different.

I have two datasets, A and B, they contain identical sets of examples and the order of these examples appear in each dataset is also identical. But they key difference is that exmaples in each dataset uses a different sets of features.

I would like to test to see if the features used in A leads to better performance than features used in B. So I would like to ensure that when I call train_test_split on A and B, I can get the same splits on both datasets so that the comparison is meaningful.

Is this possible? Do I simply need to ensure the random_state in both method calls for both datasets are the same?

Thanks

Upvotes: 14

Views: 18487

Answers (4)

torayeff
torayeff

Reputation: 9702

Since sklearn.model_selection.train_test_split(*arrays, **options) accepts a variable number of arguments, you can just do like this:

A_train, A_test, B_train, B_test, _, _ =  train_test_split(A, B, y, 
                                                           test_size=0.33,
                                                           random_state=42)

Upvotes: 2

Alex Ferguson
Alex Ferguson

Reputation: 107

As mentioned above you can use Random state parameter. But if you want to globally generate the same results means setting the random state for all future calls u can use.

np.random.seed('Any random number ')

Upvotes: 0

eqzx
eqzx

Reputation: 5559

Yes, random state is enough.

>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X2 = np.hstack((X,X))
>>> X_train, X_test, _, _ = train_test_split(X,y, test_size=0.33, random_state=42)
>>> X_train2, X_test2, _, _ = train_test_split(X2,y, test_size=0.33, random_state=42)
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> X_train2
array([[4, 5, 4, 5],
       [0, 1, 0, 1],
       [6, 7, 6, 7]])
>>> X_test
array([[2, 3],
       [8, 9]])
>>> X_test2
array([[2, 3, 2, 3],
       [8, 9, 8, 9]])

Upvotes: 18

piman314
piman314

Reputation: 5355

Looking at the code for the train_test_split function, it sets the random seed inside the function at every call. So it will result in the same split every time. We can check that this works pretty simply

X1 = np.random.random((200, 5))
X2 = np.random.random((200, 5))
y = np.arange(200)

X1_train, X1_test, y1_train, y1_test = model_selection.train_test_split(X1, y,
                                                                        test_size=0.1,
                                                                        random_state=42)
X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X1, y,
                                                                        test_size=0.1,
                                                                        random_state=42)

print np.all(y1_train == y2_train)
print np.all(y1_test == y2_test)

Which outputs:

True
True

Which is good! Another way of doing this problem is to create one training and test split on all your features and then split your features up before training. However if you're in a weird situation where you need to do both at once (sometimes with similarity matrices you don't want test features in your training set), then you can use the StratifiedShuffleSplit function to return the indices of the data that belongs to each set. For example:

n_splits = 1 
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits, 
                                             test_size=0.1,
                                             random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]

Upvotes: 8

Related Questions