user3236918
user3236918

Reputation: 47

How to split the data into training and test sets?

One approach to split the data into two disjoint sets, one for training and one for tests is taking the first 80% as the training set and the rest as the test set. Is there another approach to split the data into training and test sets?

** For example, I have a data contains 20 attributes and 5000 objects. Therefore, I will take 12 attributes and 1000 objects as my training data and 3 attributes from the 12 attributes as test set. Is this method correct?

Upvotes: 2

Views: 1096

Answers (2)

Sean Owen
Sean Owen

Reputation: 66886

No, that's invalid. You would always use all features in all data sets. You split by "objects" (examples).

It's not clear why you are taking just 1000 objects and trying to extract a training set from that. What happened to the other 4000 you threw away?

Train on 4000 objects / 20 features. Cross-validate on 500 objects / 20 features. Evaluate performance on the remaining 500 objects/ 20 features.

Upvotes: 2

manlio
manlio

Reputation: 18902

If your training produces a classifier based on 12 features, it could be (very) hard to evaluate its performances on a test set based only on a subset of these features (your classifier is expecting 12 inputs and you'll give only 3).

Feature/attribute selection/extraction is important if your data contains many redundant or irrelevant features. So you could identify and use only the most informative features (maybe 12 features) but your training/validation/test sets should be based on the same number of features (e.g. since you're mentioning weka Why do I get the error message 'training and test set are not compatible'?).

Remaining on a training/validation/test split (holdout method), a problem you can face is that the samples might not be representative.

For example, some classes might be represented with very few instance or even with no instances at all.

A possible improvement is stratification: sampling for training and testing within classes. This ensures that each class is represented with approximately equal proportions in both subsets.

However, by partitioning the available data into fixed training/test set, you drastically reduce the number of samples which can be used for learning the model. An alternative is cross validation.

Upvotes: 1

Related Questions