user2095107
user2095107

Reputation: 73

Why does WEKA-TestSets has to have a class attribute?

I have very well defined training set for machine learning (only string attributes).

e.g.

@relation training_rel

@attribute class {politics,sports}
@attribute text string

@data
politics,'some text about politics over here'
... // a lot of other training instances of class politics
sports,'and now some sports over here'
... // a lot of other training instances of class sports

Ok this is my training set, of course just an example ... Now I would like to build a classifier (NaiveBayes). This works completely fine. I know that most of the classifiers cannot handle text, so I have to filter my data. I use a StringToWordVector for this.

All of the examples over the web that I have found define there test-instances also with class value (http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/) But why? I mean I don't know whether my text belongs to politics or sports, this why I use the classifier to get to know this ... Do I understand something wrong?

Upvotes: 2

Views: 134

Answers (1)

greeness
greeness

Reputation: 16124

The labels in the test dataset is for classifier evaluation purposes. You train your model against the training dataset and evaluate the model performance on the testing dataset. Without the labels, you cannot evaluate the testing data.

In the real usage time, you won't know the actual labels. So it's important to have your test data representative of the real dataset. Otherwise your evaluation result is of no value.

Upvotes: 1

Related Questions