Walter
Walter

Reputation: 2811

Verify compatibility of instance with Weka classifier

I need to build a Weka classifier and then use it to predict future instances. A good source to get started is here. Unfortunately, I have noticed that future instances do not need to match the format of the source training data.

How are predictions made with such differences between the training data and the new instances?

Example train:

@relation train

@attribute A1 {e,f,g}
@attribute A2 numeric
@attribute A3 numeric
@attribute A4 {positive, negative}

@data
e, -100, 100, positive
f, -10, 10, positive
g, -90, 90, negative

Example test:

@relation test

@attribute B1 {b,a}
@attribute B2 numeric
@attribute B3 {good, bad}

@data
b, 100, good
a, 10, bad
b, 90, good

If you save the above training and test data sets you can use the following code to see that a model built on the training data is able to classify instances from the test data.

import java.io.BufferedReader;
import java.io.FileReader;
import weka.classifiers.Classifier;
import weka.classifiers.bayes.NaiveBayes;
import weka.core.Instances;

public class Main {
    public static void main(String[] args) throws Exception {

        //
        // Load train data
        // 
        String readTrain = "someWhere/train.arff";
        BufferedReader readerTrain = new BufferedReader(new FileReader(readTrain));
        Instances train = new Instances(readerTrain);
        readerTrain.close();
        train.setClassIndex(train.numAttributes() - 1);         

        //
        // Load test data
        // 
        String readTest = "someWhere/test.arff";
        BufferedReader readerTest = new BufferedReader(new FileReader(readTest));
        Instances test = new Instances(readerTest);
        readerTest.close();
        test.setClassIndex(test.numAttributes() - 1);  

        // Create a naïve bayes classifier
        Classifier cModel = (Classifier)new NaiveBayes();
        cModel.buildClassifier(train);

        // Predict distribution of instance
        double[] fDistribution = cModel.distributionForInstance(test.instance(2));
        System.out.println("Prediction class 1: " + fDistribution[0]);
        System.out.println("Prediction class 2: " + fDistribution[1]);
    }
}

Any explanation for how predictions are being made with differing data sources, or ideas for enforcing that new instances match the format of the classifier's original training data is appreciated. However I DO NOT want to rely on the Evaluation class.

Upvotes: 2

Views: 301

Answers (2)

jair.jr
jair.jr

Reputation: 657

I know this question is old, but my answer might help someone else.

The method Instances.equalHeaders(Instances) is what you are looking for. It returns true if both Instances are compatible or false otherwise.

Upvotes: 1

shirowww
shirowww

Reputation: 573

You could test a new instance vs the dataset employed to build the classifier, if able [ trainset.checkInstance(yourInstance) ]. That guarantees compatibility. If the training set is too large, you could perform a subsample of it, via filtering. This way, you can test if any new instance meets the requirements for your classifier.

Upvotes: 0

Related Questions