Kazi Abu Jafor Jaber
Kazi Abu Jafor Jaber

Reputation: 161

What does this Code mean? (Train Test Split Scikitlearn)

Everywhere I go I see this code. Need help understanding this.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)

what does X_train, X_test, y_train, y_test mean in this context which should I put in fit() and predict()

Upvotes: 2

Views: 2893

Answers (3)

Sociopath
Sociopath

Reputation: 13426

In simple terms, train_test_split divides your dataset into training dataset and validation dataset.

The validation set is used to evaluate a given model.

So in this case validation dataset gives us idea about model performance.

X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)

The above line splits the data into 4 parts

  1. X_train - training dataset
  2. y_train - o/p of training dataset
  3. X_test - validation dataset
  4. y_test - o/p of validation dataset

and testsize = 0.2 means you'll have 20% validation data and 80% training data

Upvotes: 2

Aniket Dixit
Aniket Dixit

Reputation: 150

`Basically this code split your data into two part.

  1. is used for training
  2. is for testing

And with the help of the test_size variable you can set the size of testing data

After dividing data into two part you have to fit training data into your model with fit() method. `

Upvotes: 1

Iuli
Iuli

Reputation: 68

As the documentation says, what train_test_split does is: Splits arrays or matrices into random train and test subsets. You can find it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. I believe the right keyword argument is test_size instead of testsize and it represents the proportion of the dataset to include in the test split if it is float or the absolute number of test samples if is is an int. X and y are the sequence of indexables with same length / shape[0], so basically the arrays/lists/matrices/dataframes to be split.

So, all in all, the code splits X and y into random train and test subsets (X_train and X_test for X and y_train and y_test for y). Each test subset should contain 20% of the original array entries as test samples. You should pass the _train subsets to fit() and the _test subsets to predict(). Hope this helps~

Upvotes: 4

Related Questions