logfiler
logfiler

Reputation: 23

Adjusted Mutual Information (scikit-learn)

I have implemented a clustering algorithm for summarizing log files, and am currently testing it against ground-truth data with the Adjusted Rand index and the Adjusted Mutual Information index.

Input to my algorithm is a list of log entries, and output is a list of integers (the cluster label that each item belongs to). The ground truth is similarly a list of integers where each integer represent the true cluster the item belongs to. For most of my test cases I receive normal/expected results, but one file is giving me unexpected output. I have enclosed the two lists, the ground-truth clustering as well as that of my algorithm's:

Ground truth list: http://pastebin.com/9Y5TE6b7

Own clustering: http://pastebin.com/hJz1M4sf

These two lists are fed into scikit-learn functions to get the ARI and AMI. The ARI score looks roughly correct, but AMI is above 1, which according to the documentation and definition of AMI should not be possible if I understand it correctly. This data set is highly unbalanced, but many of my other files are similarly balanced. I cannot figure this out. For reference, the scores I get for ARI and AMI is:

ARI: 0.99642743999922712

AMI: 1.0190170466324

Upvotes: 2

Views: 897

Answers (1)

joeln
joeln

Reputation: 3643

This has been fixed in the development version.

Upvotes: 1

Related Questions