Training Random forest with different datasets gives totally different result! Why?

Question

I am working with a dataset which contains 12 attributes including the timestamp and one attribute as the output. Also it has about 4000 rows. Besides there is no duplication in the records. I am trying to train a random forest to predict the output. For this purpose I created two different datasets:

ONE: Randomly chose 80% of data for the training and the other 20% for the testing.
TWO: Sort the dataset based on timestamp and then the first 80% for the training and the last 20% for the testing.

Then I removed the timestamp attribute from the both dataset and used the other 11 attributes for the training and the testing (I am sure the timestamp should not be part of the training).

RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.

I do appreciate if someone can help me to know

why I have this huge difference.
Also I need to have the test dataset with the latest timestamps (same as the dataset in the second experiment). Is there anyway to select data from the rest of the dataset for the training to improve the training.

PS: I already test the random selection from the first 80% of the timestamp and it doesn't improved the performance.

IVlad · Accepted Answer

First of all, it is not clear how exactly you're testing. Second, either way, you are doing the testing wrong.

RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.

Is this for the training set or the test set? If the test set, that means you have poor generalization.

You are doing it wrong because you are not allowed to tweak your model so that it performs well on the same test set, because it might lead you to a model that does just that, but that generalizes badly.

You should do one of two things:

1. A training-validation-test split

Keep 60% of the data for training, 20% for validation and 20% for testing in a random manner. Train your model so that it performs well on the validation set using your training set. Make sure you don't overfit: the performance on the training set should be close to that on the validation set, if it's very far, you've overfit your training set. Do not use the test set at all at this stage.

Once you're happy, train your selected model on the training set + validation set and test it on the test set you've held out. You should get acceptable performance. You are not allowed to tweak your model further based on the results you get on this test set, if you're not happy, you have to start from scratch.

2. Use cross validation

A popular form is 10-fold cross validation: shuffle your data and split it into 10 groups of equal or almost equal size. For each of the 10 groups, train on the other 9 and test on the remaining one. Average your results on the test groups.

You are allowed to make changes on your model to improve that average score, just run cross validation again after each change (make sure to reshuffle).

Personally I prefer cross validation.

I am guessing what happens is that by sorting based on timestamp, you make your algorithm generalize poorly. Maybe the 20% you keep for testing differ significantly somehow, and your algorithm is not given a chance to capture this difference? In general, your data should be sorted randomly in order to avoid such issues.

Of course, you might also have a buggy implementation.

I would suggest you try cross validation and see what results you get then.

Training Random forest with different datasets gives totally different result! Why?

Answers (1)

Related Questions