Reputation: 23
I am struggling with putting into an R code/script the following binning "algorithm/procedure", which may be similar to those used for binned Kernel Density estimation :
Say we have some data:
set.seed(12345) # setting seed
x<-rnorm(100) # generating data
and a grid for estimation (e.g. Kernel Density Estimation):
y<-seq(from=min(x)-1, to=max(x)+1, by=0.01) # grid for binning
The objective is to bin y
into some number of equal intervals/bins so that each bin contains at least one observation from x
(number of bins = empty bins are not allowed). For this particular example I am aware that such number of bins is equal to 17
but I would like R
to automatically determine such "optimal/maximum" number of bins and bin y
accordingly.
Say the desired number of equal intervals/bins is determined then one can use (at least from my active googling) the following to bin y
:
nbins<-cut(y, 17) # binning
which does the job very well as it splits y
exactly the way I want but how to determine the center of each bin (perhaps using median()
?) as well as the number of x
which fall into each bin?
There is an interesting package binr
with very good functional, however, it does not seem to offer exactly what I am looking for. I would be really grateful for any hints, tips, suggestions ...
EDIT: an example of a code with which I ended up with for my calculations.
First, I would like to say special thanks @missuse for the help, effort and input. Second, I would like to apologize for my ignorance (hopefully due to the lack of experience with R, and programming in general) of some base
R functions.
I was transforming and experimenting with the code @missuse developed for my calculations, however, the problem of missing x
constantly was coming up, and often required manual adjustments for different data sets. Especially, when I was experimenting with break points determined by sample quantiles of my data. Also cut
function appeared to be quite sensitive in my view (note: this is probably quite subjective due to my goals, data etc.). So, the other day tired of constant adjustments and going through help()
command for various R functions, hist()
came to my rescue and resolved almost all my binning problems. So below is very straightforward illustration to determine how many x
fall into each bin and how to determine bin center of each bin:
hist(x, breaks=c(-5:5), plot=FALSE)$counts # for bin counts
hist(x, breaks=c(-5,5), plot=FALSE)$mids # for bin centers
Above I hypothetically select desired breaks, you can build up a function based on the cut
function the way you desire and cut your grid for estimation accordingly. @missuse below provides a good foundation for setting breaks with cut
, just make sure your data spans across your breaks
specification in hist()
.
Upvotes: 1
Views: 1489
Reputation: 19756
perhaps something like this:
data:
set.seed(12345) # setting seed
x<-rnorm(100)
y<-seq(from=min(x)-1, to=max(x)+1, by=0.01)
nbins<-cut(y, 17)
step 1:
for all possible cuts find if any elements of x is in all bins:
p =lapply(3 : length(x), function(i){
nbins<-cut(y, i)
z = lapply(levels(nbins), function(j) y[nbins == j])
sumi = lapply(z, function(i) {
mini = min(i)
maxi = max(i)
sum(mini <= x & x <= maxi)
}
)
return(sum(unlist(sumi)>0) == length(sumi))
}
)
which(unlist(p)), only first 4 satisfy the rule, so 3, 4, 5, 6
step 2:
put values in a list according to bin:
z = lapply(levels(nbins), function(x) y[nbins == x] )
perform function of interest per list item
lapply(z, median) #median for each bin
lapply(z, function(i) {
mini = min(i)
maxi = max(i)
sum(mini <= x & x <= maxi)
}
) #number of elements of x in each bin
Based on the result some bins have 0 elements from x so bins 17 does not solve your problem at step 1.
EDIT: on the problem with missing x
:
sum(unlist(lapply(z, function(i) {
mini = min(i)
maxi = max(i)
sum(mini <= x & x <= maxi)
}
))) is less than 100 in many cases
which x are missing:
nbins<-cut(y, 8)
z = lapply(levels(nbins), function(x) y[nbins == x])
gix = lapply(z, function(i) {
mini = min(i)
maxi = max(i)
x[mini <= x & x <= maxi]
}
)
x[!x %in% unlist(gix)]
#-1.6620502 -0.8115405
so they should be in bins (-1.67,-0.812]
and (-0.812,0.0446]
and are in fact close to the bin cutoff.
This is happening since y
is rounded at two decimals. For instance if we bin a sequence: 0.01, 0.02, 0.03, and 0.04 and cut it in 2 bins that split the data at 0.025, we would get bin 1: 0.01 - 0.02 and bin 2: 0.03 - 0.04, if we then try to assign some random x
value from range 0.01 - 0.04, based only on y
values present in bins we would not assign anything in 0.02 - 0.03 range - hence the missing values.
A possible solution is to round x
to 2 since you already did a seq
rounded to 2. Or do a seq with y
values rounded at 4 - 6 decimals and round x
accordingly. Or instead of assigning x
based on min(yi)
and max(yi)
in bin i, replace min(yi) <= x
with max(yi-1) < x
(max(yi) from bin i-1), and replace x <= max(yi)
with x < min(yi+1)
.
Here is the simplest solution with rounding x at 2 decimals.
p =lapply(2 : length(x), function(i){
nbins<-cut(y, i)
z = lapply(levels(nbins), function(j) y[nbins == j])
sumi = lapply(z, function(i) {
mini = min(i)
maxi = max(i)
p = round(x, 2)
sum(mini <= p & p <= maxi)
}
)
return(sum(unlist(sumi)>0) == length(sumi))
}
)
that will at least solve the problem of missing x elements
the solution to the optimization problem is the same tho
which(unlist(p))
, only first 4 satisfy the rule, so 3, 4, 5, 6
Upvotes: 1