Sylvia Rodriguez
Sylvia Rodriguez

Reputation: 1353

Make median and quartile split columns of data in data frame, based on condition (R)

If I use mtcars as an example:

 mtcars <- subset(mtcars, select = c("cyl", "disp"))

How can I add two additional columns, one that indicates that values are below/above the median and one that indicates which quartile the values are in? However, I would like this to be done per group of cyl.

This is the specific outcome I am hoping for:

                   cyl  disp    median_split    quartile_split
 Toyota Corolla    4    71.1    below_median    1st_quartile
 Honda Civic       4    75.7    below_median    1st_quartile
 Fiat 128          4    78.7    below_median    1st_quartile
 Fiat X1-9         4    79      below_median    2nd_quartile
 Lotus Europa      4    95.1    below_median    2nd_quartile
 Datsun 710        4    108     median          median
 Toyota Corona     4    120.1   above_median    3rd_quartile
 Porsche 914-2     4    120.3   above_median    3rd_quartile
 Volvo 142E        4    121     above_median    4th_quartile
 Merc 230          4    140.8   above_median    4th_quartile
 Merc 240D         4    146.7   above_median    4th_quartile
 Ferrari Dino      6    145     below_median    1st_quartile
 Mazda RX4         6    160     etc…            etc…

I would appreciate help. Thank you.

Edit following on from akun's answer below

In the quartile_split column, akun's answer left the lowest value in each cyl group with NA. I thought I could fix this by adding:

 mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution

So the full code would be:

 library(dplyr)
 mtcars <- subset(mtcars, select = c("cyl", "disp"))
 # akrun's answer
 mtcars <- mtcars %>%
     group_by(cyl) %>% 
     mutate(median_split = c("above_median", "below_median")[1 + 
                   (disp <= median(disp))], 
            quartile_split = cut(disp, breaks = quantile(disp), 
                 labels = paste0(1:4, "_quartile")))
 # addition
 mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution

However, when I looked more carefully, I also found something else that did not seem quite right, specifically, when you look at the cyl = 6 group only, you see this:

 cyl  disp      median_split    quartile_split
 6    145       below_median    1_quartile
 6    160       below_median    1_quartile
 6    160       below_median    1_quartile
 6    167.6     below_median    2_quartile
 6    167.6     below_median    2_quartile
 6    225       above_median    4_quartile
 6    258       above_median    4_quartile

The median disp in this group is 163.8, so the two cars with disp = 167.6 should be classified as "above_median", instead of "below_median".

I hope this could be solved, somehow. Thank you again.

Upvotes: 0

Views: 1288

Answers (2)

RaphaelS
RaphaelS

Reputation: 869

With base R and cut:

mtcars <- subset(mtcars, select = c("cyl", "disp"))
mtcars$median_split <- ifelse(mtcars$disp <= median(mtcars$disp), "below_median","above_median")
mtcars$quantile_split <- cut(mtcars$disp, breaks = c(0, quantile(mtcars$disp)),labels = c("1_quartile",paste0(1:4, "_quartile")))

Be careful when using the cut function to make sure the breaks include the minimum (or it will return NA), and that the min. is labelled in the 1st quartile.

Upvotes: 1

akrun
akrun

Reputation: 887223

An option would be to group by 'cyl', use cut to create the different categories based on the quantile on 'disp' column

library(dplyr)
mtcars %>%
    group_by(cyl) %>% 
    mutate(median_split = c("above_median", "below_median")[1 + 
                  (disp <= median(disp))], 
           quartile_split = cut(disp, breaks = quantile(disp), 
                labels = paste0(1:4, "_quartile")))

Upvotes: 2

Related Questions