Erin
Erin

Reputation: 11

How to make a function to categorizing variable values with percentile(quantile) in R?

"data" is a data.frame and has 10 numeric variables. I want to make all the variables as categorized variables with 6 percentile groups (under 5%, between 5%~25%, between 25%~50%, between 50%~75%, between 75%~95%, over 95%) I want to make it with a function so I can categorize all the variables all at ones.

I can only do this without a function as below, so I have to repeat the same codes over and over.

m1<- quantile(data$val, 0.05)
m2<- quantile(data$val, 0.25)
m3<- quantile(data$val, 0.5)
m4<- quantile(data$val, 0.75)
m5<- quantile(data$val, 0.95)

data$val[data$val<m1]  = "below0.05"
data$val[data$val>= m1& data$val<m2 ]  = "0.05to0.25"
data$val[data$val>= m2& data$val<m3 ]  = "0.25to0.5"
data$val[data$val>= m3& data$val<m4 ]  = "0.5to0.75"
data$val[data$valT>= m4& data$val<m5 ]  = "0.75to0.95"
data$val[data$val>= m5]  = "upper0.95"

data$val <-as.factor(data$val)

I tried some codes with lapply() and function(data,name)

fun =function(data, name) {
  y <-get(name,data)
   m1<- quantile(name,data, 0.05)
   m2<- quantile(name,data, 0.25)
   m3<- quantile(name,data, 0.5)
   m4<- quantile(name,data, 0.75)
   m5<- quantile(name,data, 0.95)
   RB = rbind(m1, m2, m3, m4, m5)
   dimnames(RB)[[2]] = "Value"

   name$data[ name$data<m1]  = "below0.05"
   name$data[ name$data>= m1& name$data<m2 ]  = "0.05to0.25"
   name$data[ name$data>= m2& name$data<m3 ]  = "0.25to0.5"
   name$data[ name$data>= m3& name$data<m4 ]  = "0.5to0.75"
   name$data[ name$data>= m4& name$data<m5 ]  = "0.75to0.95"
   name$data[ name$data>= m5]  = "upper0.95"

   name$data <-as.factor(name$data)
}

It works only throughout the halfway. I want to know how to make it right. Plus, I want to know how to apply "lapply()" here so that I can categorize all the variables easily. Please, anyone help!

Error in `$<-.data.frame`(`*tmp*`, "name", value = character(0)) : 
  replacement has 0 rows, data has 301
In addition: Warning messages:
1: Unknown or uninitialised column: 'name'. 
 Show Traceback

Rerun with Debug

Upvotes: 1

Views: 1325

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388797

We can use cut to divide data into breaks using quantile and use lapply to apply it for multiple columns. So something like this should work for 1st 10 columns.

lapply(df[1:10], function(x) cut(x, 
    breaks = c(-Inf, quantile(x, c(0.05, 0.25, 0.5, 0.75, 0.95))), 
    labels = c("below0.05", "0.05to0.25", "0.25to0.5", "0.5to0.75", "0.75to0.95")))

Upvotes: 3

Related Questions