Reputation: 1792
I've used cut()
to create bins for a column of my data. Replicatable example using mtcars
below:
library(tidyverse)
df <- mtcars
df$mpg_binned <- cut(x = df$mpg, breaks = 4)
df <- df %>% select(mpg, mpg_binned, everything())
head(df)
#> mpg mpg_binned cyl disp hp drat wt qsec vs am
#> Mazda RX4 21.0 (16.3,22.1] 6 160 110 3.90 2.620 16.46 0 1
#> Mazda RX4 Wag 21.0 (16.3,22.1] 6 160 110 3.90 2.875 17.02 0 1
#> Datsun 710 22.8 (22.1,28] 4 108 93 3.85 2.320 18.61 1 1
#> Hornet 4 Drive 21.4 (16.3,22.1] 6 258 110 3.08 3.215 19.44 1 0
#> Hornet Sportabout 18.7 (16.3,22.1] 8 360 175 3.15 3.440 17.02 0 0
#> Valiant 18.1 (16.3,22.1] 6 225 105 2.76 3.460 20.22 1 0
With the new column mpg_binned
(above), I'd like to create a new chr
column that serves as a label of what the bins are (for when I'm creating output tables for the end-users).
So my desired output would be something like
#> mpg mpg_binned bin_label
#> Mazda RX4 21.0 (16.3,22.1] 16.3 < mpg <= 22.1
#> Mazda RX4 Wag 21.0 (16.3,22.1] 16.3 < mpg <= 22.1
#> Datsun 710 22.8 (22.1,28] 22.1 < mpg <= 28
#> Hornet 4 Drive 21.4 (16.3,22.1] 16.3 < mpg <= 22.1
If there were only two bins, I'd use ifelse()
--with multiple bins, do I need to do a nested ifelse()
? Is there something simpler?
For some reason, I can't get the below line to work, for the case of one bin. And I'm looking to label all of the bins.
Tidyverse solutions would be great, but I'm open to all solutions.
df$bin_label <-
ifelse(df2$mpg_binned=="(16.3,22.1]", yes = "16.3 < mpg <= 22.1", no = df2$mpg_binned)
Upvotes: 0
Views: 43
Reputation: 388797
Since we don't know beforehand what are the exact breaks
data is going to get divided, one option would be to use regex to extract numbers. So for mpg
column, we can do
sub("\\((\\d+\\.?\\d?),(\\d+\\.?\\d?).*", "\\1 < mpg <= \\2", df$mpg_binned)
#[1] "16.3 < mpg <= 22.1" "16.3 < mpg <= 22.1" "22.1 < mpg <= 28"
# "16.3 < mpg <= 22.1" "16.3 < mpg <= 22.1" ......
This would avoid writing multiple ifelse
statements since it can grow as number of breaks
increase.
For the updated case where we want to perform some mathematical operation on the extracted numbers we might need to extract the number separately.
library(dplyr)
df %>%
mutate(first_part = sub("\\((\\d+\\.?\\d?).*", "\\1", mpg_binned),
second_part = as.numeric(sub(".*,(\\d+\\.?\\d?).*", "\\1", mpg_binned)) - 1) %>%
tidyr::unite(combined, first_part, second_part, sep = "< mpg <= ")
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_binned combined
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 (16.3,22.1] 16.3< mpg <= 21.1
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 (16.3,22.1] 16.3< mpg <= 21.1
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 (22.1,28] 22.1< mpg <= 27
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 (16.3,22.1] 16.3< mpg <= 21.1
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 (16.3,22.1] 16.3< mpg <= 21.1
#...
Upvotes: 1
Reputation: 886938
Here is one option with str_replace
where we capture the numeric part as a group and replace with the backreference (\\1
, \\2
) of the group interspersed with the strings we want to add
library(tidyverse)
df %>%
rownames_to_column('rn') %>%
mutate(bin_label = str_replace(mpg_binned,
"\\(([0-9.]+),([0-9.]+)\\]", "\\1 < mpg <= \\2")) %>%
column_to_rownames('rn')
# mpg mpg_binned cyl disp hp drat wt qsec vs am gear carb bin_label
#Mazda RX4 21.0 (16.3,22.1] 6 160.0 110 3.90 2.620 16.46 0 1 4 4 16.3 < mpg <= 22.1
#Mazda RX4 Wag 21.0 (16.3,22.1] 6 160.0 110 3.90 2.875 17.02 0 1 4 4 16.3 < mpg <= 22.1
#Datsun 710 22.8 (22.1,28] 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.1 < mpg <= 28
#Hornet 4 Drive 21.4 (16.3,22.1] 6 258.0 110 3.08 3.215 19.44 1 0 3 1 16.3 < mpg <= 22.1
#Hornet Sportabout 18.7 (16.3,22.1] 8 360.0 175 3.15 3.440 17.02 0 0 3 2 16.3 < mpg <= 22.1
#Valiant 18.1 (16.3,22.1] 6 225.0 105 2.76 3.460 20.22 1 0 3 1 16.3 < mpg <= 22.1
Or another option is to extract the numeric part and paste
df %>%
rownames_to_column('rn') %>%
mutate(bin_label = map_chr(str_extract_all(mpg_binned, "[0-9.]+"),
~ str_c(first(.x), "< mpg <= ", last(.x)))) %>%
column_to_rownames('rn')
Upvotes: 1