ToNoY
ToNoY

Reputation: 1378

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function

breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))

My split dataframe looks like

... ...

$`(15,20]`
val ks_Result   c
15         60 237
18         70 247
... ...

$`(20,25]`
val ks_Result   c
21         20 317
24         10 140
... ...

My bins looks like

> table(data)
data
    (0,5]    (5,10]   (10,15]   (15,20]   (20,25]   (25,30]   (30,35] 
        0         0         0         7       128      2748      2307 
  (35,40]   (40,45]   (45,50]   (50,55]   (55,60]   (60,65]   (65,70] 
     1404     11472      1064       536      7389      1008      1714 
  (70,75]   (75,80]   (80,85]   (85,90]   (90,95]  (95,100] (100,105] 
     2047       700       329      1107       399       376       323 
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140] 
      314        79      1008        77       474       158       381 
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175] 
       89       660        15      1090       109       824       247 
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210] 
     1226       139       531       174      1041       107       257 
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245] 
       72       671        98       212        70        95        25 
(245,250] 
      494 

When I mean the bins, I get on an average of ~900 samples

> mean(table(data))
[1] 915.9

I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.

I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!

library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))

Upvotes: 1

Views: 3935

Answers (2)

BrodieG
BrodieG

Reputation: 52637

Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:

df <- data.frame(var=runif(500, 0, 100))   # make data
cut.vec <- cut(
  df$var, 
  breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
  include.lowest=T
)  
df.split <- split(df, cut.vec)

Hmisc::cut2 has this option built in as well.

Upvotes: 2

ToNoY
ToNoY

Reputation: 1378

Can be done by the function provided here by Joris Meys

EqualFreq2 <- function(x,n){
  nx <- length(x)
  nrepl <- floor(nx/n)
  nplus <- sample(1:n,nx - nrepl*n)
  nrep <- rep(nrepl,n)
  nrep[nplus] <- nrepl+1
  x[order(x)] <- rep(seq.int(n),nrep)
  x
}

data<-split(df2, EqualFreq2(df2$val, 25))

Upvotes: 0

Related Questions