Reputation: 141
I have to perform some simple operations upon few vectors and rows.
Assume that I have a database such as:
observation outcome_1_a outcome_2_a outcome_1_b outcome_2_b choice_a choice_b
1 41 34 56 19 1 1
2 32 78 43 6 2 1
3 39 19 18 55 1 2
For each observation, outcome_1 and outcome_2 are the two possible outcomes, choice is the outcome chosen and the prefix _i, with i = a,b, indicates the number of times the situation is repeated.
If I want to create variables storing the highest outcome for each situation (a,b), hence:
max.a <- pmax(data$outcome_1_a, data$outcome_2_a)
max.b <- pmax(data$outcome_1_b, data$outcome_2_b)
Similarly, if I want to create variables storing the values chosen in each situation, I can do:
choice.a <- ifelse(data$choice_a == "1", data$outcome_1_a, data$outcome_1_b)
choice.b <- ifelse(data$choice_b == "1", data$outcome_2_a, data$outcome_2_b)
Finally, If I'd like to compute the mean by row of the situations a and b, I can do:
library(data.table)
setDT(data)
data[, .(Mean = rowMeans(.SD)), by = observation, .SDcols = c("outcome_1_a","outcome_2_a", "outcome_1_b", "outcome_2_b")]
Now, all of these work just fine. However, I was wondering if such operations can be done in a more efficient way.
In the example there are only few situations, but, if in the future I'll have to deal with, let's say, 15 or more different situations (a,b,c,d,...,), writing such operations might be annoying.
Is there a way to automate such process based on the different prefixes and/or suffixes of the variables?
Thank you for your help
Upvotes: 1
Views: 65
Reputation: 1718
For instance:
outcome_1_a outcome_2_a outcome_1_b outcome_2_b outcome_1_c outcome_2_c outcome_1_d outcome_2_d outcome_1_e outcome_2_e choice_a choice_b choice_c choice_d choice_e
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 85 32 28 91 42 32 96 27 29 2 1 1 1 1
2 17 22 84 53 11 69 16 66 11 41 1 2 2 1 1
3 92 98 76 83 18 27 21 51 92 41 1 1 1 1 2
4 63 49 61 64 100 28 43 51 22 94 1 2 1 1 1
Define an index variable that will help you go through the loops:
seqmax <- seq(1, 10, by = 2)
seqmax
is a 1 3 5 7 9
. The reason being is that there are 5 letters "a" "b" "c" "d" "e"
. So this sequence will help you to pattern the loop. This can be automated for the max number of letters, just find the column index for the last column before choice_a
. Then you can do seq(1, grep(names(data), pattern = "choice_a") - 1, by = 2)
. The by = 2
argument can be adjusted for the number of columns by letter.
I use lapply
with <<-
to assing the new column to data
.
lapply(c(1:5), function(x){
data[, paste0("max.", letters[x])] <<- apply(data[, c(seqmax[x], seqmax[x] + 1)], 1, max)
data[, paste0("choice.", letters[x])] <<- ifelse(
data[, grep(names(data), pattern = paste0("choice_", letters[x]), value = T)] == 1,
data[, seqmax[x]], data[, seqmax[x] + 1])
data[, paste0("mean.", letters[x])] <<- rowMeans(
data[, grep(names(data), pattern = paste0("outcome_\\d+_", letters[x]), value = T)])
})
Upvotes: 1
Reputation: 3240
You can select columns with some regex. For example, to get your max.a
value.
library(data.table)
setDT(data)
data[, do.call(pmax, .SD), .SDcols = names(data) %like% "\\d+_a$"]
[1] 41 78 39
Alternatively, you could select your columns with some regex outside of the data.table
. Lots of ways to go about this.
Similar application to your last command.
data[,
.(Mean = rowMeans(.SD)),
by = observation,
.SDcols = names(data) %like% "^outcome"]
observation Mean
1: 1 37.50
2: 2 39.75
3: 3 32.75
For choice.a
, how would you choose between b, c, d, e etc?
Upvotes: 1