Reputation: 51

Building compatible datasets for Weka for large, evolving data

I have a largish dataset that I am using Weka to explore. It goes like this: today I will analyze as much data as I can, and create a trained classifier. I'll save this model as a file. Then tomorrow I will acquire a new batch of data, and want to use the saved model to predict the class for the new data. This repeats every day. Eventually I will update the saved model, but for now assume that it is static.

Due to the size and frequency of this task, I want to run this automatically, which means the command line or similar. However, my problem exists in the Explorer, as well.

My question has to do with the fact that, as my dataset grows, the list of possible labels for attributes also grows. Weka says such attribute lists cannot change, or the training set and test set are said to be incompatible (see: http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F). But in my world there is no way that I could possibly know today all the attribute labels that I will stumble across next week.

To rectify the situation, it is suggested that I run batch filtering (http://weka.wikispaces.com/How+do+I+generate+compatible+train+and+test+sets+that+get+processed+with+a+filter%3F). Okay, that appears to mean that I need to re-build my model with the refiltered training data each day.

At this point the whole thing seems difficult enough that I fear I am making a horrible, simple newbie mistake, and so I ask for help.

DETAILS:

The model was created by

java -Xmx1280M weka.classifiers.meta.FilteredClassifier ^
    -t .\training.arff -d .\my.model -c 15 ^
    -F "weka.filters.supervised.attribute.Discretize -R first-last" ^
    -W weka.classifiers.trees.J48 -- -C 0.25 -M 2

Naively, to predict I would try:

java -Xmx1280M weka.core.converters.DatabaseLoader ^
    -url jdbc:odbc:(database) ^
    -user (user) ^
    -password (password) ^
    -Q "exec (my_stored_procedure) '1/1/2012', '1/2/2012' " ^
    \> .\NextDay.arff

And then:

java -Xmx1280M weka.classifiers.trees.J48 ^
    -T .\NextDay.arff ^ 
    -l .\my.model ^ 
    -c 15 ^
    -p 0 ^ 
    \> .\MyPredictions.txt

this yields:

java.lang.Exception: training and test set are not compatible
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1035)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)

A related question is asked at kdkeys.net/training-and-test-set-are-not-compatible-weka/

An associated problem is that the command-line version of the database extraction requires generation of a temporary .arff file, and it appears JDBC-generated arff files do not handle "date" data correctly. My database generates dates of the ISO-8601 format "yyyy-MM-dd'T'HH:mm:ss" but both Explorer and generated .arff files from JDBC data represent these as type NOMINAL. And so the list of labels for date attributes in the header is very, very long and never the same from dataset to dataset.

I'm not a java or python programmer, but if that's what it takes, I'll go buy some books! Thanks in advance.

Upvotes: 1

Answers (3)

doxav

Reputation: 988

The university who maintains Weka also created MOA (Massive Online Analysis) to analyse and solve your kind of problem. All of their classifiers are updatable and you can compare classifiers performance over the time for your data flow. It also allows you to detect change of models (concept drift/shift) and optimize (ie. limit) your data window over the time (forget old data mechanism...).

Once you're done with testing and tuning with MOA, you can then use MOA classifiers from Weka (there is an extension to enable it) and batch all your process.

Upvotes: 0

kaz

Reputation: 685

There is a bigger problem with your plan too, it seems. If you have data from day 1 and you use it to build a model, then you use it on data from day n that has new and never before seen class labels, it will be impossible to predict the new labels because there is no training data for them. Similarly, if you have new attributes, it will be impossible to use those for classification because none of your training data has them to associate with the class labels.

Thus, if you want to use a model trained on data with only a subset of the new data's attributes/classes, then you might as well filter the new data to remove the new classes/attributes since they wouldn't be used even if you could execute weka without errors on two dissimilar datasets.

If it's not in your training set, exclude it from your test set. Then everything should work. If you need to be able to test/predict on it, then you need to retrain a new model that has examples of the new classes/attributes.

Doing this in your environment might require manually querying data out of the database into arff files, so as to query out only the attributes/classes that were in the training set. Look into sql and any major scripting language (e.g. perl, python) to do this without much fuss.

Upvotes: 1

SANN3

Reputation: 10099

I think you can use Incremental classifiers. But only few classifier can support for this option. Like SMO, J48 classifiers wont support this. So you will use some other classifier to classify.

To know more visit

http://weka.wikispaces.com/Classifying+large+datasets

http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka

Upvotes: 1

Building compatible datasets for Weka for large, evolving data

Answers (3)

Related Questions