Reputation: 141
#calculate NMI(c,t) c : cluster assignment , t : ground truth
NMI <- function(c,t){
n <- length(c) # = length(t)
r <- length(unique(c))
g <- length(unique(t))
N <- matrix(0,nrow = r , ncol = g)
for(i in 1:r){
for (j in 1:g){
N[i,j] = sum(t[c == i] == j)
}
}
N_t <- colSums(N)
N_c <- rowSums(N)
B <- (1/n)*log(t( t( (n*N) / N_c ) / N_t))
W <- B*N
I <- sum(W,na.rm = T)
H_c <- sum((1/n)*(N_c * log(N_c/n)) , na.rm = T)
H_t <- sum((1/n)*(N_t * log(N_t/n)) , na.rm = T)
nmi <- I/sqrt(H_c * H_t)
return (nmi)
}
Running this on some clustering benchmarks here gives me a value of the Normalized Mutual information . But , when I compare it with values of NMI obtained from the aricode library , I get values of NMI that generally differ in the second decimal place .
I will be grateful if someone is able to pin-point any possible error that has creeped into this code .
I am including a test case for this using a synthetic case :
library(aricode)
c <- c(1,1,2,2,2,3,3,3,3,4,4,4)
t <- c(1,2,2,2,3,4,3,3,3,4,4,2)
print(aricode::NMI(c , t)) #0.489574
print(NMI(c,t)) #0.5030771
Upvotes: 0
Views: 534
Reputation: 36
This might be very late for an answer but for the sake of posterity:
The difference is in the way you and the aricode
package normalise the index. You divide by sqrt()
whereas aricode
offers the following options:
function (c1, c2, variant = c("max", "min", "sqrt", "sum", "joint"))
so if you select variant = sqrt
you should hopefully get the same answer.
The NMI
package uses sum.
Upvotes: 2