Jeremy K.
Jeremy K.

Reputation: 1792

Create a `chr` column of labels after binning data with `cut()`

I've used cut() to create bins for a column of my data. Replicatable example using mtcars below:

library(tidyverse)
df <- mtcars

df$mpg_binned <- cut(x = df$mpg, breaks = 4)
df <- df %>% select(mpg, mpg_binned, everything())
head(df)

#>                    mpg  mpg_binned cyl disp  hp drat    wt  qsec vs am
#> Mazda RX4         21.0 (16.3,22.1]   6  160 110 3.90 2.620 16.46  0  1
#> Mazda RX4 Wag     21.0 (16.3,22.1]   6  160 110 3.90 2.875 17.02  0  1
#> Datsun 710        22.8   (22.1,28]   4  108  93 3.85 2.320 18.61  1  1
#> Hornet 4 Drive    21.4 (16.3,22.1]   6  258 110 3.08 3.215 19.44  1  0
#> Hornet Sportabout 18.7 (16.3,22.1]   8  360 175 3.15 3.440 17.02  0  0
#> Valiant           18.1 (16.3,22.1]   6  225 105 2.76 3.460 20.22  1  0

With the new column mpg_binned (above), I'd like to create a new chr column that serves as a label of what the bins are (for when I'm creating output tables for the end-users).

So my desired output would be something like

#>                    mpg  mpg_binned bin_label 
#> Mazda RX4         21.0 (16.3,22.1]   16.3 < mpg <= 22.1  
#> Mazda RX4 Wag     21.0 (16.3,22.1]   16.3 < mpg <= 22.1  
#> Datsun 710        22.8   (22.1,28]   22.1 < mpg <= 28  
#> Hornet 4 Drive    21.4 (16.3,22.1]   16.3 < mpg <= 22.1 


If there were only two bins, I'd use ifelse()--with multiple bins, do I need to do a nested ifelse()? Is there something simpler?

For some reason, I can't get the below line to work, for the case of one bin. And I'm looking to label all of the bins.

Tidyverse solutions would be great, but I'm open to all solutions.

df$bin_label <- 
  ifelse(df2$mpg_binned=="(16.3,22.1]", yes = "16.3 < mpg <= 22.1", no = df2$mpg_binned)

Upvotes: 0

Views: 43

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388797

Since we don't know beforehand what are the exact breaks data is going to get divided, one option would be to use regex to extract numbers. So for mpg column, we can do

sub("\\((\\d+\\.?\\d?),(\\d+\\.?\\d?).*", "\\1 < mpg <= \\2", df$mpg_binned)
#[1] "16.3 < mpg <= 22.1" "16.3 < mpg <= 22.1" "22.1 < mpg <= 28"  
#    "16.3 < mpg <= 22.1" "16.3 < mpg <= 22.1" ......

This would avoid writing multiple ifelse statements since it can grow as number of breaks increase.


For the updated case where we want to perform some mathematical operation on the extracted numbers we might need to extract the number separately.

library(dplyr)

df %>%
  mutate(first_part =  sub("\\((\\d+\\.?\\d?).*", "\\1", mpg_binned), 
         second_part = as.numeric(sub(".*,(\\d+\\.?\\d?).*", "\\1", mpg_binned)) - 1) %>%
  tidyr::unite(combined, first_part, second_part, sep = "< mpg <= ")


#   mpg cyl  disp  hp drat    wt  qsec vs am gear carb  mpg_binned          combined
#1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 (16.3,22.1] 16.3< mpg <= 21.1
#2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 (16.3,22.1] 16.3< mpg <= 21.1
#3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   (22.1,28]   22.1< mpg <= 27
#4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 (16.3,22.1] 16.3< mpg <= 21.1
#5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 (16.3,22.1] 16.3< mpg <= 21.1
#...

Upvotes: 1

akrun
akrun

Reputation: 886938

Here is one option with str_replace where we capture the numeric part as a group and replace with the backreference (\\1, \\2) of the group interspersed with the strings we want to add

library(tidyverse)
df %>% 
   rownames_to_column('rn') %>%
   mutate(bin_label  = str_replace(mpg_binned,
               "\\(([0-9.]+),([0-9.]+)\\]", "\\1 < mpg <= \\2")) %>%
   column_to_rownames('rn')

#                    mpg  mpg_binned cyl  disp  hp drat    wt  qsec vs am gear carb          bin_label
#Mazda RX4           21.0 (16.3,22.1]   6 160.0 110 3.90 2.620 16.46  0  1    4    4 16.3 < mpg <= 22.1
#Mazda RX4 Wag       21.0 (16.3,22.1]   6 160.0 110 3.90 2.875 17.02  0  1    4    4 16.3 < mpg <= 22.1
#Datsun 710          22.8   (22.1,28]   4 108.0  93 3.85 2.320 18.61  1  1    4    1   22.1 < mpg <= 28
#Hornet 4 Drive      21.4 (16.3,22.1]   6 258.0 110 3.08 3.215 19.44  1  0    3    1 16.3 < mpg <= 22.1
#Hornet Sportabout   18.7 (16.3,22.1]   8 360.0 175 3.15 3.440 17.02  0  0    3    2 16.3 < mpg <= 22.1
#Valiant             18.1 (16.3,22.1]   6 225.0 105 2.76 3.460 20.22  1  0    3    1 16.3 < mpg <= 22.1

Or another option is to extract the numeric part and paste

df %>%
    rownames_to_column('rn') %>%
    mutate(bin_label = map_chr(str_extract_all(mpg_binned, "[0-9.]+"),
              ~ str_c(first(.x), "< mpg <= ", last(.x)))) %>%
   column_to_rownames('rn')

Upvotes: 1

Related Questions