Reputation: 1653
I have a fixed training dataset file train.csv
and another test dataset file test.csv
. I know train_test_split()
method in sklearn can do split work. But I want to create 2 datasets seperately with each dataset from exactly each file.
I have tested
# The X,Y and X_, Y_ following are training and test samples/labels (dataframes)
trainX, testX, trainY, testY = train_test_split( X, Y, test_size = 0)
trainX_, testX_, trainY_, testY_ = train_test_split( X_, Y_, test_size = 1.0)
# not accepted parameter
# ...
dtree = tree.DecisionTreeClassifier(criterion="gini")
dtree.fit(trainX, trainY)
...
Y_pred = dtree.predict(testX_)
and take trainX, trainY
to train, take testX_, testY_
to predict.
However, train_test_split()
method doesn't accept test_size=1.0
, leading to a failure.
So what's the right way to create training and test datasets separately?
Upvotes: 0
Views: 2120
Reputation: 5955
The purpose of train_test_split is to create both a train and a test set with random sampling. If you want to use all of X_, y_
as a holdout set to test on, then you don't need to split it at all and rather just split X, y
. If you already have 2 files, then you can just use dtree.fit(X, y)
and dtree.score(X_, y_)
, assuming you're happy with both sets being accurate and random samples of the data
Upvotes: 2