Reputation: 103

Cannot Reproduce the Splitting of Train and Test using sklearn

I am using Jupyter Notebook version 5.6.0 through the Anaconda Navigator. I am trying to split my dataset to Train and Test but whenever I shutdown and reopen my notebook and rerun the code, it creates a different training and test set. The random_state works though if I just rerun the code without shutting down the notebook. Is this normal? Is there a way to fix this such that eve though I close and reopen the notebook, it would still split the dataset the same way?

I have set the random state of the train_test_split method but it still produces a different data split.

Here is my code so far:

#Split dataset into Training and Testing
from sklearn.model_selection import train_test_split

y = train['Target']

x_train, x_test, y_train, y_test = train_test_split(train, y, test_size=0.2, random_state = 0)
x_train.head()

The result from the first run is this:

      |   ID        |    Country           |   Target

7093  |   9.56      |      Tokyo           |    Yes

5053  |   9.58      |      Bangkok         |    Yes

1627  |   9.53      |      New York        |    No

2514  |   9.55      |      Los Angeles     |    No

Rerun values shows:

      |   ID        |    Country           |   Target

3805  |   9.51      |     Chicago          |    No

6730  |   9.59      |     Seattle          |    No

7623  |   9.57      |     Busan            |    Yes

7045  |   9.60      |     Seoul            |    Yes

Upvotes: 0

Answers (2)

Kea Ivo

Reputation: 31

Where and how do you get your testing data from? If your data comes from a dynamic source (a random generated data, data from a server, or if you are reducing the data, by picking random values from it) it will couse suck an issue. I would usually solve the problem, by creating a copy of my data, so that I can refer to it liter,by using pickle.io. This code assumes that I've already have my data as a dataframe:

df.to_pickle(file_name)

Next I would simply use the data I hava stored, using:

df = pd.read_pickle(file_name)

Then from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

Upvotes: 1

foxpal

Reputation: 623

Try setting the random state in numpy:

import numpy as np
np.random.seed(42)

Upvotes: 0

Cannot Reproduce the Splitting of Train and Test using sklearn

Answers (2)

Related Questions