EmptyHead
EmptyHead

Reputation: 23

A binning procedure in R?

I am struggling with putting into an R code/script the following binning "algorithm/procedure", which may be similar to those used for binned Kernel Density estimation :

Say we have some data:

set.seed(12345) # setting seed
x<-rnorm(100)   # generating data

and a grid for estimation (e.g. Kernel Density Estimation):

y<-seq(from=min(x)-1, to=max(x)+1, by=0.01) # grid for binning

  1. The objective is to bin y into some number of equal intervals/bins so that each bin contains at least one observation from x (number of bins = empty bins are not allowed). For this particular example I am aware that such number of bins is equal to 17 but I would like R to automatically determine such "optimal/maximum" number of bins and bin y accordingly.

  2. Say the desired number of equal intervals/bins is determined then one can use (at least from my active googling) the following to bin y:

nbins<-cut(y, 17) # binning

which does the job very well as it splits y exactly the way I want but how to determine the center of each bin (perhaps using median()?) as well as the number of x which fall into each bin?

There is an interesting package binr with very good functional, however, it does not seem to offer exactly what I am looking for. I would be really grateful for any hints, tips, suggestions ...

EDIT: an example of a code with which I ended up with for my calculations.

First, I would like to say special thanks @missuse for the help, effort and input. Second, I would like to apologize for my ignorance (hopefully due to the lack of experience with R, and programming in general) of some base R functions.

I was transforming and experimenting with the code @missuse developed for my calculations, however, the problem of missing x constantly was coming up, and often required manual adjustments for different data sets. Especially, when I was experimenting with break points determined by sample quantiles of my data. Also cut function appeared to be quite sensitive in my view (note: this is probably quite subjective due to my goals, data etc.). So, the other day tired of constant adjustments and going through help() command for various R functions, hist() came to my rescue and resolved almost all my binning problems. So below is very straightforward illustration to determine how many x fall into each bin and how to determine bin center of each bin:

hist(x, breaks=c(-5:5), plot=FALSE)$counts # for bin counts 
hist(x, breaks=c(-5,5), plot=FALSE)$mids   # for bin centers

Above I hypothetically select desired breaks, you can build up a function based on the cut function the way you desire and cut your grid for estimation accordingly. @missuse below provides a good foundation for setting breaks with cut, just make sure your data spans across your breaks specification in hist().

Upvotes: 1

Views: 1489

Answers (1)

missuse
missuse

Reputation: 19756

perhaps something like this:

data:

 set.seed(12345) # setting seed
 x<-rnorm(100)
 y<-seq(from=min(x)-1, to=max(x)+1, by=0.01) 
 nbins<-cut(y, 17)

step 1:

for all possible cuts find if any elements of x is in all bins:

p =lapply(3 : length(x), function(i){
  nbins<-cut(y, i)
  z = lapply(levels(nbins), function(j) y[nbins == j])
  sumi = lapply(z, function(i) {
    mini = min(i)
    maxi = max(i)
    sum(mini <= x & x <= maxi)
  }
  )
  return(sum(unlist(sumi)>0) == length(sumi))
}
)

which(unlist(p)), only first 4 satisfy the rule, so 3, 4, 5, 6 

step 2:

put values in a list according to bin:

z = lapply(levels(nbins), function(x) y[nbins == x] )

perform function of interest per list item

lapply(z, median) #median for each bin

lapply(z, function(i) {
  mini = min(i)
  maxi = max(i)
  sum(mini <= x & x <= maxi)
}
) #number of elements of x in each bin

Based on the result some bins have 0 elements from x so bins 17 does not solve your problem at step 1.

EDIT: on the problem with missing x:

sum(unlist(lapply(z, function(i) {
  mini = min(i)
  maxi = max(i)
  sum(mini <= x & x <= maxi)
}
))) is less than 100 in many cases

which x are missing:

nbins<-cut(y, 8) 
    z = lapply(levels(nbins), function(x) y[nbins == x])
    gix = lapply(z, function(i) {
      mini = min(i)
      maxi = max(i)
      x[mini <= x & x <= maxi]
    }
    )
  x[!x %in% unlist(gix)]

 #-1.6620502 -0.8115405 

so they should be in bins (-1.67,-0.812] and (-0.812,0.0446] and are in fact close to the bin cutoff.

This is happening since y is rounded at two decimals. For instance if we bin a sequence: 0.01, 0.02, 0.03, and 0.04 and cut it in 2 bins that split the data at 0.025, we would get bin 1: 0.01 - 0.02 and bin 2: 0.03 - 0.04, if we then try to assign some random x value from range 0.01 - 0.04, based only on y values present in bins we would not assign anything in 0.02 - 0.03 range - hence the missing values.

A possible solution is to round x to 2 since you already did a seq rounded to 2. Or do a seq with y values rounded at 4 - 6 decimals and round x accordingly. Or instead of assigning x based on min(yi) and max(yi) in bin i, replace min(yi) <= x with max(yi-1) < x (max(yi) from bin i-1), and replace x <= max(yi) with x < min(yi+1). Here is the simplest solution with rounding x at 2 decimals.

p =lapply(2 : length(x), function(i){
  nbins<-cut(y, i)
  z = lapply(levels(nbins), function(j) y[nbins == j])
  sumi = lapply(z, function(i) {
    mini = min(i)
    maxi = max(i)
    p = round(x, 2)
    sum(mini <= p & p <= maxi)
  }
  )
  return(sum(unlist(sumi)>0) == length(sumi))
}
)

that will at least solve the problem of missing x elements

the solution to the optimization problem is the same tho

which(unlist(p)), only first 4 satisfy the rule, so 3, 4, 5, 6

Upvotes: 1

Related Questions