KidCrippler
KidCrippler

Reputation: 1723

Weka filters cause data loss

I'm using weka to build a random forest model. My data is stored in a MySQL DB. I couldn't find a way to create a weka dataset ('Instances' object) directly from the DB (at least not a straightforward one), so I query the DB and manipulate the data into a weka dataset (Instances) with this code:

    List<MetadataRecord> metadata = acquireMetadata(); // Loading from DB

    int datasetSize = metadata.size();
    int numFeatures = MetadataRecord.FEATURE_NUM;  // Currently set to 14

    ArrayList<Attribute> atts = new ArrayList<Attribute>();
    List<Instance> instances = new ArrayList<Instance>();
    for (int feature = 0; feature < numFeatures; feature++) {
        Attribute current = new Attribute("Attribute" + feature, feature);
        if (feature == 0) {
            for (int obj = 0; obj < datasetSize; obj++) {
                instances.add(new SparseInstance(numFeatures));
            }
        }

        for (int obj = 0; obj < datasetSize; obj++) {
            MetadataRecord record = metadata.get(obj);
            Instance inst = instances.get(obj);
            switch (feature) {
            case 0:
                inst.setValue(current, record.labelId);
                break;
            case 1:
                inst.setValue(current, record.isSecured ? 2 : 1);
                break;
            case 2:
                inst.setValue(current, record.pageCount);
                break;
                // Spared cases 3-13...
            }
        }
        atts.add(current);
    }

    Instances newDataset = new Instances("Metadata", atts, instances.size());

    for (Instance inst : instances) {
        newDataset.add(inst);
    }
    newDataset.setClassIndex(0);

Most of the data is entered as 'numeric', while I need some of the features (first and second) to be categorical (or "Nominal", according to weka terminology). I tried to convert them to nominal using filters:

    NumericToNominal nomFilter = new NumericToNominal();
    nomFilter.setAttributeIndicesArray(new int[] { 0, 1 });
    nomFilter.setInputFormat(newDataset);
    newDataset = Filter.useFilter(newDataset, nomFilter);

This works well, but surprisingly, when debugging the dataset, some of the data is lost!

Before applying filter:

@attribute Attribute0 numeric
@attribute Attribute1 numeric
@attribute Attribute2 numeric
// Spared the other 10 Attributes
@data
{0 1005,1 1,2 19,3 1123,4 7,5 25,6 0.66,7 49,8 2892.21,9 5.32,10 22.63,11 0.4,12 48.95,13 5.29}

After applying filter:

@attribute Attribute0 {0,2,3,4,5,6,7,9,11,12,18,22,23,24,25,35,36,39,40,45,51,56,60,67,68,69,78,79,83,84,85,88,94,98,126,127,128,1001,1003,1004,1005,1007,1008,1009,1012,1013,1017,1018,1019,1022}
@attribute Attribute1 {1,2}
@attribute Attribute2 numeric
// Spared the other 10 Attributes
@data
{0 1005,2 19,3 1123,4 7,5 25,6 0.66,7 49,8 2892.21,9 5.32,10 22.63,11 0.4,12 48.95,13 5.29}

Why did I lose the value of the second attribute?

Upvotes: 1

Views: 293

Answers (1)

Sentry
Sentry

Reputation: 4113

The feature is not lost, it is just not explicitly contained in your output, because it is in sparse format. Have a look at ARFF:

Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be explicitly represented.

Sparse ARFF files have the same header (i.e @relation and @attribute tags) but the data section is different. Instead of representing each value in order, like this:

@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"

the non-zero attributes are explicitly identified by attribute number and their value stated, like this:

@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}

Each instance is surrounded by curly braces, and the format for each entry is: where index is the attribute index (starting from 0).

Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a value is unknown, you must explicitly represent it with a question mark (?).

Especially the last sentence is important. Your Attribute1 has two possible values, 1 and 2. Since it is now nominal, the value 1 has the index 0. And values with index 0 are omitted.

Again: This is just the representation in memory and when you print it to a file or the screen. The actual content of your data set did not change.

Upvotes: 1

Related Questions