jack
jack

Reputation: 273

to calculate the Entropy

I am new to R and unable to calculate the entropy. There is a similar question on stackoverflow with the answer but i wanted to know why this code isn't working. Here is the copy paste data from the same question.

One of the answer mentions, "The part I think you are missing is the calculation of the class frequencies and you will get your answer", but how do i fix this. I tried most of the options but still i don't get any output. It just runs without any errors.

info <- function(CLASS.FREQ){
      freq.class <- CLASS.FREQ
      info <- 0
      for(i in 1:length(freq.class)){
        if(freq.class[[i]] != 0){ # zero check in class
          entropy <- -sum(freq.class[[i]] * log2(freq.class[[i]]))  #I calculate the entropy for each class i here
        }else{ 
          entropy <- 0
        } 
        info <- info + entropy # sum up entropy from all classes
      }
      return(info)
    }

Dataset as below,

buys <- c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no")

credit <- c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent")

student <- c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no")

income <- c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium")

age <- c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44) 

we change the age from categorical to numeric

Cheers, Jack

Upvotes: 1

Views: 1201

Answers (1)

mra68
mra68

Reputation: 2960

You need to calculate the propertion of "no" and "yes" in "buys", the proportion of "fair" and "excellent" in "credit", and so on. Here is one way to do it:

data <- list(
  buys = c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"),
  credit = c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent"),
  student = c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no"),
  income = c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium"),
  age = c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44) 
  )

freq <- lapply( data, function(x){rowMeans(outer(unique(x),x,"=="))})

.

> freq
$buys
[1] 0.3571429 0.6428571

$credit
[1] 0.5714286 0.4285714

$student
[1] 0.5 0.5

$income
[1] 0.2857143 0.4285714 0.2857143

$age
 [1] 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857
[14] 0.07142857

Such a proportion can never be 0. So change if(freq.class[[i]] != 0){ # zero check in class to if(length(freq.class[[i]]) != 0){ # zero check in class:

info <- function(CLASS.FREQ){
  freq.class <- CLASS.FREQ
  info <- 0
  for(i in 1:length(freq.class)){
    if(length(freq.class[[i]]) != 0){ # zero check in class
      entropy <- -sum(freq.class[[i]] * log2(freq.class[[i]]))  #I calculate the entropy for each class i here
    }else{ 
      entropy <- 0
    } 
    info <- info + entropy # sum up entropy from all classes
  }
  return(info)
}

.

> info(freq)
[1] 8.289526
> info(freq$buys)
[1] 0.940286
> info(freq$age)
[1] 3.807355
> 

Upvotes: 1

Related Questions