Sailesh
Sailesh

Reputation: 115

Binning two vectors of different ranges using R

I'm trying to assess the performance of a simple prediction model using R, by discretizing the prediction results by binning them into defined intervals and then compare them with the corresponding actual values(binned).

I have two vectors actual and predicted as shown:

> actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
> predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)

I need to perform binning here. First off, the values of 'actual' are factorized/discretized into different levels, say: 0-5: Level 1 ** 6-10: Level 2 ** ... ** 41-45: Level 9

Now, I've to bin the values of 'predicted' also into the above mentioned buckets. I tried to achieve this using the cut() function in R:

binCount <- 5
binActual <- cut(actual,labels=1:binCount,breaks=binCount)
binPred <- cut(predicted,labels=1:binCount,breaks=binCount)

However, if you see the second element in predicted (98.01) is labelled as 5, but it doesn't actually fall in the desired interval. I feel that using a different binCount for predicted will not help.Can anyone please suggest a solution for this ?

Upvotes: 0

Views: 554

Answers (2)

probaPerception
probaPerception

Reputation: 591

I'm not 100% sure of what you want to do.

However from what I understand you want to return for each element of each vector the class it would be in. Given a set of class that takes into account any value from any of the two vectors actual and predicted.

If it is what you want to do, then your script (as you say) creates classes for values between 0 and 45. With this cut you class your first vector.

Then you create a new set of classes for your vector predicted. The classification is not the same anymore.

Assuming that I understood what you want to do, I'd rather write :

actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)

temporary = c(actual, predicted)
maxi <- max(temporary)
mini <- min(temporary)
binCount <- 5
s <- seq(maxi, mini, length.out = binCount)
s = sort(s)

binActual <- cut(actual,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
binPred <- cut(predicted,breaks=s, include.lowest = T, labels = 1:(length(s)-1))

It gives :

> binActual
 [1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4

> binPred
 [1] 1 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4

I'm not sure it is what you're looking for, so let me know, I might be able to help you. Best wishes.

Upvotes: 2

Lars Lau Raket
Lars Lau Raket

Reputation: 1974

Is this what you want?

intervals <- cbind(seq(0, 40, length = 9), seq(5, 45, length = 9))

cutFixed <- function(x, intervals) {
    sapply(x, function(x) ifelse(x < min(intervals) | x >= max(intervals), NA, which(x >= intervals[,1] & x < intervals[,2])))
}

This gives the following result

> cutFixed(actual, intervals)
 [1] 1 1 1 1 9 1 1 2 1 1 1 1 1 1 2 1 1 1 4 1
> cutFixed(predicted, intervals)
 [1]  1 NA  1  1  7  1  1  1  1  1  1  3  1  2  1  1  1  2  1

Upvotes: 0

Related Questions