Reputation: 161
Everywhere I go I see this code. Need help understanding this.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
what does X_train, X_test, y_train, y_test mean in this context which should I put in fit() and predict()
Upvotes: 2
Views: 2893
Reputation: 13426
In simple terms, train_test_split
divides your dataset into training dataset and validation dataset.
The validation set is used to evaluate a given model.
So in this case validation dataset gives us idea about model performance.
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
The above line splits the data into 4 parts
and testsize = 0.2
means you'll have 20% validation data and 80% training data
Upvotes: 2
Reputation: 150
`Basically this code split your data into two part.
And with the help of the test_size variable you can set the size of testing data
After dividing data into two part you have to fit training data into your model with fit() method. `
Upvotes: 1
Reputation: 68
As the documentation says, what train_test_split
does is: Splits arrays or matrices into random train and test subsets
. You can find it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. I believe the right keyword argument is test_size
instead of testsize
and it represents the proportion of the dataset to include in the test split
if it is float or the absolute number of test samples
if is is an int.
X and y are the sequence of indexables with same length / shape[0]
, so basically the arrays/lists/matrices/dataframes to be split.
So, all in all, the code splits X and y into random train and test subsets (X_train and X_test for X and y_train and y_test for y). Each test subset should contain 20% of the original array entries as test samples.
You should pass the _train
subsets to fit()
and the _test
subsets to predict()
. Hope this helps~
Upvotes: 4