user3236918
user3236918

Reputation: 47

How to make training and testing set from a dataset?

What's the best method:

  1. splitting my data into training and testing sets by making 70% of the data as training and 30% test, or
  2. using similar data for training and testing set.

A- Is the second method correct and what's its disadvantages?

B- My dataset contains 3 attributes and 1000 objects, is this good for selecting the training and testing sets from this dataset?

Upvotes: 0

Views: 2029

Answers (3)

doxav
doxav

Reputation: 988

  1. You can create Train-Test from a dataset in command line:

    java -cp weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 30 -i dataset.arff -o train.arff java -cp weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 70 -i dataset.arff -o test.arff

  2. and A): except if "all" the future possible data combinations exist in your dataset, using same data for train and test is a bad solution. It does not assess how your model is able to handle different new cases and can't assess if you are overfitting (it fits to your current data without re-usable logic). Why don't you use "cross validation", this is very effective if you want to use the same dataset. It automatically splits in different parts, and test each part against the rest of the data, then compute the average result.

B) if you mean 3 attributes and 1000 instances, it could be ok if you don't have too much different type of outputs (classes) to predict and that instances map good use cases.

FYI: if you want to test your data on many different classifiers to find the best one, use the experimenter.

Upvotes: 0

chopss
chopss

Reputation: 811

Second option is wrong. First option is the best....

Using ling-pipe classifier we can train and test news data. But if you provide same data used in training for testing purposes no doubt it shows accurate output. What we want is predicting output for unknown cases that's how we test accuracy right.

So what you have to do is

1)Train your data
2)Build a model
3)Apply test data to the model to get output for unknown sets/ cases too. 

Building a model is nothing but writing the trained object into a file. So each time you runs the program you have to put the data into that model instead of training each time. This saves your time. I hope my answer will help you. Best regards.

Upvotes: 0

manlio
manlio

Reputation: 18902

The second method is wrong (at least if by 'similar' you mean 'same').

You shouldn't use the test set for training.

If you use just one data set, you could achieve perfect accuracy by simply learning this set (with the risk of overfitting). Generally, this isn't what you want because the algorithm should learn the general concept behind the examples. A way of testing if this happens is to use separate dataset for training and testing.

Test set gives you a forecast of the performance of your model in the "real world" because it's independent (during the training/validation phase you don't make any choice based on test data).

Upvotes: 2

Related Questions