Reputation: 41
I am a bit worried when using Weka's ReplaceMissingValues
to input the missing values only for the test arff dataset but not for the training dataset. Below is the commandline:
java -classpath weka.jar weka.filters.unsupervised.attribute.ReplaceMissingValues -c last -i "test_file_with_missing_values.arff" -o "test_file_with_filled_missing_values.arff"
From a previous post (Replace missing values with mean (Weka)), I came to know that Weka's ReplaceMissingValues
simply replace each missing value with the mean of the corresponding attribute. This implies that the mean needs to be computed for each attribute. While computation of this mean is perfectly fine for the training file, it is not okay for the test file.
This is because in the typical test scenario, we should not assume that we know the mean of the test attribute for the input missing values. We only have one test record with multiple attributes for classification instead of having the entire set of test records in a test file. Therefore, instead, we shall input the missing value based on the mean computed using the training data. Then above command would become incorrect as we would need to have another input (the means of the train attributes).
Has anybody thought about this before? How do you work around this by using weka?
Upvotes: 4
Views: 3515
Reputation: 4113
Easy, see Batch Filtering
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Standardize filter = new Standardize();
filter.setInputFormat(train); // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter); // create new test set
The filter is initialized using the training data and then applied on both training and test data.
The problem is when you apply the ReplaceMissingValue filter outside any processing pipeline, because after writing the filtered data, you can't distinguish between "real" values and "imputed" values anymore. This is why you should do everything that needs to be done in a single pipeline, e.g., using the FilteredClassifier:
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a
This example will initialize the ReplaceMissingValues filter using the "training_file_with_missing_values.arff" data set, then apply the filter on "test_file_with_missing_values.arff" (with the means learning from the training set), then train a multilayer perceptron on the filtered training data and predict the class of the filtered test data.
Upvotes: 2