lincr
lincr

Reputation: 1653

How to create train dataset and test dataset separately in sklearn?

I have a fixed training dataset file train.csv and another test dataset file test.csv. I know train_test_split() method in sklearn can do split work. But I want to create 2 datasets seperately with each dataset from exactly each file.

I have tested

# The X,Y and X_, Y_ following are training and test samples/labels (dataframes)
trainX, testX, trainY, testY = train_test_split( X, Y, test_size = 0)
trainX_, testX_, trainY_, testY_ = train_test_split( X_, Y_, test_size = 1.0)  
                                 # not accepted parameter
# ...
dtree = tree.DecisionTreeClassifier(criterion="gini")
dtree.fit(trainX, trainY)
...
Y_pred = dtree.predict(testX_)

and take trainX, trainY to train, take testX_, testY_ to predict.
However, train_test_split() method doesn't accept test_size=1.0, leading to a failure.

So what's the right way to create training and test datasets separately?

Upvotes: 0

Views: 2120

Answers (1)

G. Anderson
G. Anderson

Reputation: 5955

The purpose of train_test_split is to create both a train and a test set with random sampling. If you want to use all of X_, y_ as a holdout set to test on, then you don't need to split it at all and rather just split X, y. If you already have 2 files, then you can just use dtree.fit(X, y) and dtree.score(X_, y_), assuming you're happy with both sets being accurate and random samples of the data

Upvotes: 2

Related Questions