Anouk van Rooijen
Anouk van Rooijen

Reputation: 21

Create new column variables based upon condition in R

I don't know why this is not working. I have tried it various ways and it just does not work. It's not that I get an error using the if-statements I made myself, but it doesn't apply right.

Basically, there is a column Data$Age and a column Data$Age2.

If Data$Age is value 50 - 100, I want Data$Age2 to say "50-100 Years" for that particular row.

Likewise, if Data$Age is 25-50, I want Data$Age2 to say "25-50 Years" for the rows to which it applies.

What would the cleanest way to go about doing this in R?

Upvotes: 0

Views: 877

Answers (3)

Uwe
Uwe

Reputation: 42544

So far, all the posted answers by Len Greski and InfiniteFlashChess in first place have suggested to use repeated subsetting statements or repeated calls to ifelse() for each age range.

IMHO, this can't be considered as clean because it doesn't scale well with the number of age ranges. The only data-driven solution suggested by Onyambu in his comment is to use the cut() function from base R.

Here, I suggest another data-driven solution which uses a lookup table with the lower and upper bounds of the age ranges and the associated labels and which updates in a non-equi join. This will allows us to specify an arbitrary number of ranges without any changes to the code:

library(data.table)
# define lookup table
lookup <- data.table(
  lower = c(25L, 51L),
  upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]
   lower upper        label
1:    25    50  25-50 Years
2:    51   100 51-100 Years
# create sample data set
Data <- data.frame(Age = c(24:26, 49:52, 100:102))

# update in non-equi join
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label][]
    Age         Age2
 1:  24           NA
 2:  25  25-50 Years
 3:  26  25-50 Years
 4:  49  25-50 Years
 5:  50  25-50 Years
 6:  51 51-100 Years
 7:  52 51-100 Years
 8: 100 51-100 Years
 9: 101           NA
10: 102           NA

Note that NA indicate gaps in the age ranges defined in the lookup table.

Benchmarking

InfiniteFlashChess has asked about benchmark results.

Any benchmarks will depend on the number of rows in Data as well as on the number of groups, i.e., age ranges. So, we will do benchmark runs for 100 and 1 M rows as well as for 2 groups (as specified by the OP) and 8 groups.

The benchmark code for 2 groups:

library(data.table)
library(dplyr)
n_row <- 1E2L
set.seed(123L)
Data0 <- data.frame(Age = sample.int(105L, n_row, TRUE))

lookup <- data.table(
  lower = c(25L, 51L),
  upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]

microbenchmark::microbenchmark(
  ifelse = {
    copy(Data0) %>%
      mutate(Age2 = ifelse(between(Age, 25, 50), "25 - 50 Years", 
                           ifelse(between(Age, 51, 100), "51 - 100 Years", 
                                  "")))
  },
  cut = {
    copy(Data0) %>% 
      mutate(Age2 = cut(Age, c(24,50,100), c("25-50 years","51-100 Years")))
  },
  baseR = {
    data <- copy(Data0)
    data$age2 <- ""
    data$age2[data$Age %in% 51:100] <- "51 - 100 years"
    data$age2[data$Age %in% 25:50] <- "25 - 50 years"
  },
  join_dt = {
    Data <- copy(Data0)
    setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label]
  },
  times = 100L
)

Benchmark results for 100 rows:

Unit: microseconds
    expr      min       lq       mean    median       uq       max neval cld
  ifelse 2280.588 2415.006 2994.83792 2501.8495 2827.513 20545.672   100   c
     cut 2272.280 2407.455 2716.67432 2537.3425 2827.135  7351.495   100   c
   baseR   57.016   83.446   94.80729   91.1865  106.667   164.248   100 a  
 join_dt 1165.970 1318.889 1560.19394 1485.4025 1691.939  2803.159   100  b

Benchmark results for 1 M rows:

Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval cld
  ifelse 618.08286 626.72757 672.28875 639.04973 758.83435 773.25566    10   c
     cut 197.16467 200.53571 219.58635 203.77460 214.24227 343.56061    10  b 
   baseR  52.96059  59.36964  76.09431  62.19055  66.32506 198.73654    10 a  
 join_dt  66.89256  67.61147  73.33428  72.55457  78.18675  81.69146    10 a

Benchmarking for 8 groups requires to write nested ifelse() or repeated subset operations:

breaks <- seq(20, 100, 10)

lookup <- data.table(
  lower = head(breaks, -1L),
  upper = tail(breaks, -1L)
)
lookup[, label := sprintf("%i-%i Years", lower + 1L, upper)][]


microbenchmark::microbenchmark(
  ifelse = {
    copy(Data0) %>%
      mutate(
        Age2 = ifelse(
          between(Age, 21, 30), "21 - 20 Years", ifelse(
            between(Age, 31, 40), "31 - 40 Years", ifelse(
              between(Age, 41, 50), "41 - 50 Years", ifelse(
                between(Age, 51, 60), "51 - 60 Years", ifelse(
                  between(Age, 61, 70), "61 - 70 Years", ifelse(
                    between(Age, 71, 80), "71 - 80 Years", ifelse(
                      between(Age, 81, 90), "81 - 90 Years", ifelse(
                        between(Age, 91, 100), "91 - 100 Years", "")))))))))
  },
  cut = {
    copy(Data0) %>% 
      mutate(Age2 = cut(Age, breaks))
  },
  subset = {
    data <- copy(Data0)
    data$age2 <- ""
    data$age2[data$Age %in% 21:30] <- "21 - 30 years"
    data$age2[data$Age %in% 31:40] <- "31 - 40 years"
    data$age2[data$Age %in% 41:50] <- "41 - 50 years"
    data$age2[data$Age %in% 51:60] <- "51 - 60 years"
    data$age2[data$Age %in% 61:70] <- "61 - 70 years"
    data$age2[data$Age %in% 71:80] <- "71 - 80 years"
    data$age2[data$Age %in% 81:90] <- "81 - 90 years"
    data$age2[data$Age %in% 91:100] <- "91 - 100 years"
  },
  join_dt = {
    Data <- copy(Data0)
    setDT(Data)[lookup, on =.(Age > lower, Age <= upper), Age2 := label]
  },
  times = 100L
)

Benchmark results for 100 rows:

Unit: microseconds
    expr      min       lq      mean    median        uq      max neval  cld
  ifelse 2522.617 2663.832 2955.2448 2740.1030 2886.4155 7717.748   100    d
     cut 2340.622 2470.699 2664.9381 2538.6635 2646.6520 7608.627   100   c 
  subset  174.820  199.741  219.6505  210.5015  231.4575  402.501   100 a   
 join_dt 1198.819 1290.949 1406.2354 1399.1255 1488.4240 1810.500   100  b

Benchmark results for 1 M rows:

Unit: milliseconds
    expr       min         lq       mean     median         uq        max neval cld
  ifelse 2427.0599 2429.42131 2539.88611 2457.06191 2565.14682 2992.68891    10   c
     cut  220.3553  221.53939  243.49476  222.82165  230.06289  406.57277    10  b 
  subset  176.0096  177.92958  199.13398  184.26878  192.60274  323.90338    10  b 
 join_dt   62.7471   64.26875   67.94099   65.07508   75.03169   75.38813    10 a

Upvotes: 1

InfiniteFlash
InfiniteFlash

Reputation: 1058

dplyr may have the cleanest solution to this

Using Len Greski's sample data below...

data <- data.frame(Age1 = round(runif(100)*100,0))

data%>%
mutate(Age2 = ifelse(between(Age1, 25, 50), "25 - 50 Years", 
              ifelse(between(Age1, 51, 100),"51 - 100 Years", "Less than 25 years old")))

Assuming you only want two values for the column. ifelse() is not efficient for more than two matches, say 100, though. I'll have to think of an alternative approach in the event that its not.

EDIT: or as Len has suggested below this, in a comment.

data%>% 
mutate(Age2 = cut(Age1,c(24,50,100),c("25-50 years","51-100 Years")))

Upvotes: 2

Len Greski
Len Greski

Reputation: 10845

Here is a solution using base R. Note that since age2 cannot simultaneously be 25 - 50 and 50 - 100 I made the categories mutually exclusive:

data <- data.frame(age = round(runif(100)*100,0),
                   age2 = rep(" ",100),stringsAsFactors=FALSE)
data$age2[data$age %in% 51:100] <- "51 - 100 years"
data$age2[data$age %in% 25:50] <- "25 - 50 years"
data[1:15,]

...and the output:

> data[1:15,]
   age           age2
1    0               
2   45  25 - 50 years
3   58 51 - 100 years
4   59 51 - 100 years
5   84 51 - 100 years
6   79 51 - 100 years
7    5               
8   78 51 - 100 years
9   46  25 - 50 years
10   6               
11  73 51 - 100 years
12  37  25 - 50 years
13   5               
14  41  25 - 50 years
15  58 51 - 100 years
> 

Upvotes: 0

Related Questions