Reputation: 2811
I need to build a Weka classifier and then use it to predict future instances. A good source to get started is here. Unfortunately, I have noticed that future instances do not need to match the format of the source training data.
How are predictions made with such differences between the training data and the new instances?
Example train:
@relation train
@attribute A1 {e,f,g}
@attribute A2 numeric
@attribute A3 numeric
@attribute A4 {positive, negative}@data
e, -100, 100, positive
f, -10, 10, positive
g, -90, 90, negative
Example test:
@relation test
@attribute B1 {b,a}
@attribute B2 numeric
@attribute B3 {good, bad}@data
b, 100, good
a, 10, bad
b, 90, good
If you save the above training and test data sets you can use the following code to see that a model built on the training data is able to classify instances from the test data.
import java.io.BufferedReader;
import java.io.FileReader;
import weka.classifiers.Classifier;
import weka.classifiers.bayes.NaiveBayes;
import weka.core.Instances;
public class Main {
public static void main(String[] args) throws Exception {
//
// Load train data
//
String readTrain = "someWhere/train.arff";
BufferedReader readerTrain = new BufferedReader(new FileReader(readTrain));
Instances train = new Instances(readerTrain);
readerTrain.close();
train.setClassIndex(train.numAttributes() - 1);
//
// Load test data
//
String readTest = "someWhere/test.arff";
BufferedReader readerTest = new BufferedReader(new FileReader(readTest));
Instances test = new Instances(readerTest);
readerTest.close();
test.setClassIndex(test.numAttributes() - 1);
// Create a naïve bayes classifier
Classifier cModel = (Classifier)new NaiveBayes();
cModel.buildClassifier(train);
// Predict distribution of instance
double[] fDistribution = cModel.distributionForInstance(test.instance(2));
System.out.println("Prediction class 1: " + fDistribution[0]);
System.out.println("Prediction class 2: " + fDistribution[1]);
}
}
Any explanation for how predictions are being made with differing data sources, or ideas for enforcing that new instances match the format of the classifier's original training data is appreciated. However I DO NOT want to rely on the Evaluation class.
Upvotes: 2
Views: 301
Reputation: 657
I know this question is old, but my answer might help someone else.
The method Instances.equalHeaders(Instances)
is what you are looking for. It returns true
if both Instances
are compatible or false
otherwise.
Upvotes: 1
Reputation: 573
You could test a new instance vs the dataset employed to build the classifier, if able [ trainset.checkInstance(yourInstance) ]. That guarantees compatibility. If the training set is too large, you could perform a subsample of it, via filtering. This way, you can test if any new instance meets the requirements for your classifier.
Upvotes: 0