John
John

Reputation: 141

Faulty NMI implementation in R?

#calculate NMI(c,t) c : cluster assignment , t : ground truth

NMI <- function(c,t){
n <- length(c) # = length(t)
r <- length(unique(c))
g <- length(unique(t))

N <- matrix(0,nrow = r , ncol = g)
for(i in 1:r){
    for (j in 1:g){
        N[i,j] = sum(t[c == i] == j)
    }
}

N_t <- colSums(N)
N_c <- rowSums(N)

B <- (1/n)*log(t( t( (n*N) / N_c ) / N_t))
W <- B*N
I <- sum(W,na.rm = T) 



H_c <- sum((1/n)*(N_c * log(N_c/n)) , na.rm = T)
H_t <- sum((1/n)*(N_t * log(N_t/n)) , na.rm = T)    

nmi <- I/sqrt(H_c * H_t)

return (nmi)
}

Running this on some clustering benchmarks here gives me a value of the Normalized Mutual information . But , when I compare it with values of NMI obtained from the aricode library , I get values of NMI that generally differ in the second decimal place .

I will be grateful if someone is able to pin-point any possible error that has creeped into this code .

I am including a test case for this using a synthetic case :

library(aricode)
c <- c(1,1,2,2,2,3,3,3,3,4,4,4)
t <- c(1,2,2,2,3,4,3,3,3,4,4,2)
print(aricode::NMI(c , t))   #0.489574
print(NMI(c,t))              #0.5030771

Upvotes: 0

Views: 534

Answers (1)

MCondut
MCondut

Reputation: 36

This might be very late for an answer but for the sake of posterity:

The difference is in the way you and the aricode package normalise the index. You divide by sqrt() whereas aricode offers the following options: function (c1, c2, variant = c("max", "min", "sqrt", "sum", "joint"))

so if you select variant = sqrt you should hopefully get the same answer.

The NMI package uses sum.

Upvotes: 2

Related Questions