Reputation: 103
I am using Jupyter Notebook version 5.6.0 through the Anaconda Navigator. I am trying to split my dataset to Train and Test but whenever I shutdown and reopen my notebook and rerun the code, it creates a different training and test set. The random_state
works though if I just rerun the code without shutting down the notebook. Is this normal? Is there a way to fix this such that eve though I close and reopen the notebook, it would still split the dataset the same way?
I have set the random state of the train_test_split
method but it still produces a different data split.
Here is my code so far:
#Split dataset into Training and Testing
from sklearn.model_selection import train_test_split
y = train['Target']
x_train, x_test, y_train, y_test = train_test_split(train, y, test_size=0.2, random_state = 0)
x_train.head()
The result from the first run is this:
| ID | Country | Target
7093 | 9.56 | Tokyo | Yes
5053 | 9.58 | Bangkok | Yes
1627 | 9.53 | New York | No
2514 | 9.55 | Los Angeles | No
Rerun values shows:
| ID | Country | Target
3805 | 9.51 | Chicago | No
6730 | 9.59 | Seattle | No
7623 | 9.57 | Busan | Yes
7045 | 9.60 | Seoul | Yes
Upvotes: 0
Views: 699
Reputation: 31
Where and how do you get your testing data from? If your data comes from a dynamic source (a random generated data, data from a server, or if you are reducing the data, by picking random values from it) it will couse suck an issue.
I would usually solve the problem, by creating a copy of my data, so that I can refer to it liter,by using pickle.io
. This code assumes that I've already have my data as a dataframe:
df.to_pickle(file_name)
Next I would simply use the data I hava stored, using:
df = pd.read_pickle(file_name)
Then from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
Upvotes: 1