Ab Bennett
Ab Bennett

Reputation: 1432

Why do I get different results when I do a manual split of test and train data as opposed to using the Python splitting function

If I run a simple dtree regression model using data via the train_test_split function, I get nice r2 scores, and low mse values.

training_data = pandas.read_csv('data.csv',usecols=['y','x1','x2','x3'])
y = training_data.iloc[:,0]
x = training_data.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
regressor = DecisionTreeRegressor(random_state = 0)  
# fit the regressor with X and Y data 
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

yet if I split the data file manually into two files 2/3 train and 1/3 test. there is a column called human which gives a value 1 to 9 which human it is, i use human 1-6 for training, and 7-9 for test

i get negative r2 scores, and high mse

training_data = pandas.read_csv("train"+".csv",usecols=['y','x1','x2','x3'])
testing_data  = pandas.read_csv("test"+".csv", usecols=['y','x1','x2','x3'])

y_train = training_data.iloc[:,training_data.columns.str.contains('y')]
X_train = training_data.iloc[:,training_data.columns.str.contains('|'.join(['x1','x2','x3']))] 
y_test = testing_data.iloc[:,testing_data.columns.str.contains('y')]
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))] 

y_train = pandas.Series(y_train['y'], index=y_train.index)
y_test = pandas.Series(y_test['y'], index=y_test.index)

regressor = DecisionTreeRegressor(random_state = 0)  
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

I was expecting more or less the same results, and all the data types seem the same for both calls.

What am I missing?

Upvotes: 1

Views: 1824

Answers (4)

Snives
Snives

Reputation: 1254

It may sound like a simple check but..

In the first example you are reading data from 'data.csv', in the second example you are reading from 'train.csv' and 'test.csv'. Since you say you split the file manually, I have a question about how that was done. If you simply cut the file at the 2/3's mark and saved as 'train.csv' and the remaining as 'test.csv' then you have unknowingly made an assumption about the uniformity of the data in the file. Data files can have an ordered structure which would skew the training or testing, which is why the train_test_split randomizes the rows. If you haven't already done it, try to randomize the rows first and then write to your train and test csv file to ensure you have a homogeneous dataset.

The other line that might be out of place is line 6:

X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))] 

Perhaps the l_vars contains something other than what you expect. Maybe it should read the following to be more consistent.

X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(['x1','x2','x3']))] 

good luck, let us know if this helps.

Upvotes: 0

Kazi Abu Jafor Jaber
Kazi Abu Jafor Jaber

Reputation: 161

suppose your dataset contains this data.

1 + 1 = 2
2 + 2 = 4
4 - 4 = 0
2 - 2 = 0

So suppose you want a 50% train split. train_test_split shuffles it like this so it genaralizes better

1+1=2
2-2= 0

So it knows what do to when it sees this data

2+2
4-4#since it learned both addition and subtraction

But when you manually shuffle it like this

1 + 1 = 2
2 + 2 =4#only learned addition

It doesn't know what do do when it sees this data

2 - 2
4 - 4#test data is subtraction

Hope this answers you question

Upvotes: 1

Cristina Morariu
Cristina Morariu

Reputation: 415

I'm assuming that both methods here actually do what you intend doing and the shapes of your X_train/test and y_train/tests are the same coming from both methods. You can either plot the underlying distributions of your datasets or compare your second implementation against a cross-validated model (for better rigour).


Plot the distributions (i.e. make bar charts/density plots) of the labels (y) in the initial train - test sets vs in the second ones (from the manual implementation). You can dive deeper and also plot your other columns in the data, see if anything about the distributions of your data is different between the resulting sets of the two implementations. If the distributions are different than it makes sense you get discrepancies between your two models. If your discrepancy is huge, it could be your labels (or other columns) are actually sorted for your manual implementation, so then you get very different distributions in the datasets you're comparing.

Also, if you want to make sure that your manual splitting results is a 'representative' set(that would generalise well) based on model results instead of underlying data distributions, I would compare it against the results of a cross-validated model, not one single set of results.

Essentially, although the probability is small and the train_test_split does some shuffling, you could essentially get a 'train/test' pair that is performing well just out of luck. (To reduce the chance of that without doing cross validation, I'd suggest making use of the stratify argument of the train_test_split function. then at least you're sure the first implementation 'tries harder' to get balanced train/test pairs.)

If you decide to cross validate (with test_train_split), you get an average model prediction for the folds and a confidence intervals around it and can check if your second model results fall within that interval. If it doesn't again, it just means your split is actually 'corrupted' somehow (e.g. by having sorted values).


P.S. I'd also add that Decision Trees are models that are known to overfit massively [1]. Maybe use a random forest instead? (you should get more stable results due to bootstraping/bagging which would act similarly to cross-validating to reduce the chance of overfitting.)

1 - http://cv.znu.ac.ir/afsharchim/AI/lectures/Decision%20Trees%203.pdf

Upvotes: 6

Darren Christopher
Darren Christopher

Reputation: 4779

The train_test_split function from scikit-learn uses sklearn.model_selection.ShuffleSplit as per the documentation and this means, this method randomize your data when splitting.

When you split manually, you didn't randomize it so if your labels is not spreaded evenly throughout your dataset, you'll of course have performance issue since your model won't be generalized enough due to train data not containing enough sample of other labels.

If my suspicion is correct, you should get similar result by passing shuffle=False into train_test_split.

Upvotes: 2

Related Questions