Reputation: 105
I want to make a test for LOF, showing how well it manages the dense-sparse problem of a dataset. In the tutorial of ELKI data generator I am shown how to make a dataset from a xml file like this with 4 clusters:
<dataset random-seed="1" test-model="1">
<cluster name="Dense" size="290">
<normal mean="0.5" stddev="0.2"/>
<normal mean="0.5" stddev="0.2"/>
<clip min="0 0" max="1 1"/>
</cluster>
<cluster name="Sparse" size="100">
<normal mean="0.25" stddev="0.05"/>
<normal mean="0.75" stddev="0.05"/>
<clip min="0 0" max="1 1"/>
</cluster>
<cluster name="Middle" size="100">
<normal mean="0.75" stddev="0.05"/>
<normal mean="0.75" stddev="0.05"/>
<clip min="0 0" max="1 1"/>
</cluster>
<cluster name="Noise" size="10" density-correction="50">
<uniform min="0" max="1"/>
<uniform min="0" max="1"/>
</cluster>
</dataset>
But how do I get a hold on the outliers. The ELKI tool want a minority label for the outliers to show a ROCAUC curve. And the file I get out of the xml file is just a file of points in the data set.
Should I then make a plot and identify the outliers myself and put a yes or no after them all to say whether they are outliers or not and set the minority label to yes, being outliers OR is there an easier way?
Upvotes: 1
Views: 484
Reputation: 8715
ELKI will default to using the smallest class for evaluation. (You can configure evaluation differently!)
ELKI will issue a warning if the outliers are more than 5% of the data, since it is assumed that outliers are rare (they should be much less than 5%, actually).
So on your data set, ELKI should default to using "Noise" as outlier class.
In your configuration Noise
should be 2% of the data set, so it should not warn. It should simply work out of the box.
Upvotes: 1