Reputation: 21
I don't know why this is not working. I have tried it various ways and it just does not work. It's not that I get an error using the if-statements I made myself, but it doesn't apply right.
Basically, there is a column Data$Age
and a column Data$Age2
.
If Data$Age
is value 50 - 100, I want Data$Age2
to say "50-100 Years" for that particular row.
Likewise, if Data$Age
is 25-50, I want Data$Age2
to say "25-50 Years" for the rows to which it applies.
What would the cleanest way to go about doing this in R?
Upvotes: 0
Views: 877
Reputation: 42544
So far, all the posted answers by Len Greski and InfiniteFlashChess in first place have suggested to use repeated subsetting statements or repeated calls to ifelse()
for each age range.
IMHO, this can't be considered as clean because it doesn't scale well with the number of age ranges. The only data-driven solution suggested by Onyambu in his comment is to use the cut()
function from base R.
Here, I suggest another data-driven solution which uses a lookup table with the lower and upper bounds of the age ranges and the associated labels and which updates in a non-equi join. This will allows us to specify an arbitrary number of ranges without any changes to the code:
library(data.table)
# define lookup table
lookup <- data.table(
lower = c(25L, 51L),
upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]
lower upper label 1: 25 50 25-50 Years 2: 51 100 51-100 Years
# create sample data set
Data <- data.frame(Age = c(24:26, 49:52, 100:102))
# update in non-equi join
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label][]
Age Age2 1: 24 NA 2: 25 25-50 Years 3: 26 25-50 Years 4: 49 25-50 Years 5: 50 25-50 Years 6: 51 51-100 Years 7: 52 51-100 Years 8: 100 51-100 Years 9: 101 NA 10: 102 NA
Note that NA
indicate gaps in the age ranges defined in the lookup table.
InfiniteFlashChess has asked about benchmark results.
Any benchmarks will depend on the number of rows in Data
as well as on the number of groups, i.e., age ranges. So, we will do benchmark runs for 100 and 1 M rows as well as for 2 groups (as specified by the OP) and 8 groups.
The benchmark code for 2 groups:
library(data.table)
library(dplyr)
n_row <- 1E2L
set.seed(123L)
Data0 <- data.frame(Age = sample.int(105L, n_row, TRUE))
lookup <- data.table(
lower = c(25L, 51L),
upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]
microbenchmark::microbenchmark(
ifelse = {
copy(Data0) %>%
mutate(Age2 = ifelse(between(Age, 25, 50), "25 - 50 Years",
ifelse(between(Age, 51, 100), "51 - 100 Years",
"")))
},
cut = {
copy(Data0) %>%
mutate(Age2 = cut(Age, c(24,50,100), c("25-50 years","51-100 Years")))
},
baseR = {
data <- copy(Data0)
data$age2 <- ""
data$age2[data$Age %in% 51:100] <- "51 - 100 years"
data$age2[data$Age %in% 25:50] <- "25 - 50 years"
},
join_dt = {
Data <- copy(Data0)
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label]
},
times = 100L
)
Benchmark results for 100 rows:
Unit: microseconds expr min lq mean median uq max neval cld ifelse 2280.588 2415.006 2994.83792 2501.8495 2827.513 20545.672 100 c cut 2272.280 2407.455 2716.67432 2537.3425 2827.135 7351.495 100 c baseR 57.016 83.446 94.80729 91.1865 106.667 164.248 100 a join_dt 1165.970 1318.889 1560.19394 1485.4025 1691.939 2803.159 100 b
Benchmark results for 1 M rows:
Unit: milliseconds expr min lq mean median uq max neval cld ifelse 618.08286 626.72757 672.28875 639.04973 758.83435 773.25566 10 c cut 197.16467 200.53571 219.58635 203.77460 214.24227 343.56061 10 b baseR 52.96059 59.36964 76.09431 62.19055 66.32506 198.73654 10 a join_dt 66.89256 67.61147 73.33428 72.55457 78.18675 81.69146 10 a
Benchmarking for 8 groups requires to write nested ifelse()
or repeated subset operations:
breaks <- seq(20, 100, 10)
lookup <- data.table(
lower = head(breaks, -1L),
upper = tail(breaks, -1L)
)
lookup[, label := sprintf("%i-%i Years", lower + 1L, upper)][]
microbenchmark::microbenchmark(
ifelse = {
copy(Data0) %>%
mutate(
Age2 = ifelse(
between(Age, 21, 30), "21 - 20 Years", ifelse(
between(Age, 31, 40), "31 - 40 Years", ifelse(
between(Age, 41, 50), "41 - 50 Years", ifelse(
between(Age, 51, 60), "51 - 60 Years", ifelse(
between(Age, 61, 70), "61 - 70 Years", ifelse(
between(Age, 71, 80), "71 - 80 Years", ifelse(
between(Age, 81, 90), "81 - 90 Years", ifelse(
between(Age, 91, 100), "91 - 100 Years", "")))))))))
},
cut = {
copy(Data0) %>%
mutate(Age2 = cut(Age, breaks))
},
subset = {
data <- copy(Data0)
data$age2 <- ""
data$age2[data$Age %in% 21:30] <- "21 - 30 years"
data$age2[data$Age %in% 31:40] <- "31 - 40 years"
data$age2[data$Age %in% 41:50] <- "41 - 50 years"
data$age2[data$Age %in% 51:60] <- "51 - 60 years"
data$age2[data$Age %in% 61:70] <- "61 - 70 years"
data$age2[data$Age %in% 71:80] <- "71 - 80 years"
data$age2[data$Age %in% 81:90] <- "81 - 90 years"
data$age2[data$Age %in% 91:100] <- "91 - 100 years"
},
join_dt = {
Data <- copy(Data0)
setDT(Data)[lookup, on =.(Age > lower, Age <= upper), Age2 := label]
},
times = 100L
)
Benchmark results for 100 rows:
Unit: microseconds expr min lq mean median uq max neval cld ifelse 2522.617 2663.832 2955.2448 2740.1030 2886.4155 7717.748 100 d cut 2340.622 2470.699 2664.9381 2538.6635 2646.6520 7608.627 100 c subset 174.820 199.741 219.6505 210.5015 231.4575 402.501 100 a join_dt 1198.819 1290.949 1406.2354 1399.1255 1488.4240 1810.500 100 b
Benchmark results for 1 M rows:
Unit: milliseconds expr min lq mean median uq max neval cld ifelse 2427.0599 2429.42131 2539.88611 2457.06191 2565.14682 2992.68891 10 c cut 220.3553 221.53939 243.49476 222.82165 230.06289 406.57277 10 b subset 176.0096 177.92958 199.13398 184.26878 192.60274 323.90338 10 b join_dt 62.7471 64.26875 67.94099 65.07508 75.03169 75.38813 10 a
Upvotes: 1
Reputation: 1058
dplyr may have the cleanest solution to this
Using Len Greski's sample data below...
data <- data.frame(Age1 = round(runif(100)*100,0))
data%>%
mutate(Age2 = ifelse(between(Age1, 25, 50), "25 - 50 Years",
ifelse(between(Age1, 51, 100),"51 - 100 Years", "Less than 25 years old")))
Assuming you only want two values for the column. ifelse()
is not efficient for more than two matches, say 100, though. I'll have to think of an alternative approach in the event that its not.
EDIT: or as Len has suggested below this, in a comment.
data%>%
mutate(Age2 = cut(Age1,c(24,50,100),c("25-50 years","51-100 Years")))
Upvotes: 2
Reputation: 10845
Here is a solution using base R. Note that since age2
cannot simultaneously be 25 - 50
and 50 - 100
I made the categories mutually exclusive:
data <- data.frame(age = round(runif(100)*100,0),
age2 = rep(" ",100),stringsAsFactors=FALSE)
data$age2[data$age %in% 51:100] <- "51 - 100 years"
data$age2[data$age %in% 25:50] <- "25 - 50 years"
data[1:15,]
...and the output:
> data[1:15,]
age age2
1 0
2 45 25 - 50 years
3 58 51 - 100 years
4 59 51 - 100 years
5 84 51 - 100 years
6 79 51 - 100 years
7 5
8 78 51 - 100 years
9 46 25 - 50 years
10 6
11 73 51 - 100 years
12 37 25 - 50 years
13 5
14 41 25 - 50 years
15 58 51 - 100 years
>
Upvotes: 0