Marco C
Marco C

Reputation: 141

R: apply operations to vectors with different suffixes

I have to perform some simple operations upon few vectors and rows.

Assume that I have a database such as:

observation      outcome_1_a   outcome_2_a   outcome_1_b   outcome_2_b choice_a choice_b 
1                41            34            56            19          1        1 
2                32            78            43            6           2        1
3                39            19            18            55          1        2

For each observation, outcome_1 and outcome_2 are the two possible outcomes, choice is the outcome chosen and the prefix _i, with i = a,b, indicates the number of times the situation is repeated.

If I want to create variables storing the highest outcome for each situation (a,b), hence:

max.a <- pmax(data$outcome_1_a, data$outcome_2_a)
max.b <- pmax(data$outcome_1_b, data$outcome_2_b)

Similarly, if I want to create variables storing the values chosen in each situation, I can do:

choice.a <- ifelse(data$choice_a == "1", data$outcome_1_a, data$outcome_1_b)
choice.b <- ifelse(data$choice_b == "1", data$outcome_2_a, data$outcome_2_b)

Finally, If I'd like to compute the mean by row of the situations a and b, I can do:

library(data.table)
setDT(data)
data[, .(Mean = rowMeans(.SD)), by = observation, .SDcols = c("outcome_1_a","outcome_2_a", "outcome_1_b", "outcome_2_b")]

Now, all of these work just fine. However, I was wondering if such operations can be done in a more efficient way.

In the example there are only few situations, but, if in the future I'll have to deal with, let's say, 15 or more different situations (a,b,c,d,...,), writing such operations might be annoying.

Is there a way to automate such process based on the different prefixes and/or suffixes of the variables?

Thank you for your help

Upvotes: 1

Views: 65

Answers (2)

JdeMello
JdeMello

Reputation: 1718

For instance:

outcome_1_a outcome_2_a outcome_1_b outcome_2_b outcome_1_c outcome_2_c outcome_1_d outcome_2_d outcome_1_e outcome_2_e choice_a choice_b choice_c choice_d choice_e
        <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1          12          85          32          28          91          42          32          96          27          29        2        1        1        1        1
2          17          22          84          53          11          69          16          66          11          41        1        2        2        1        1
3          92          98          76          83          18          27          21          51          92          41        1        1        1        1        2
4          63          49          61          64         100          28          43          51          22          94        1        2        1        1        1

Define an index variable that will help you go through the loops:

seqmax <- seq(1, 10, by = 2)

seqmax is a 1 3 5 7 9. The reason being is that there are 5 letters "a" "b" "c" "d" "e". So this sequence will help you to pattern the loop. This can be automated for the max number of letters, just find the column index for the last column before choice_a. Then you can do seq(1, grep(names(data), pattern = "choice_a") - 1, by = 2). The by = 2 argument can be adjusted for the number of columns by letter.

I use lapply with <<- to assing the new column to data.

lapply(c(1:5), function(x){

  data[, paste0("max.", letters[x])] <<- apply(data[, c(seqmax[x], seqmax[x] + 1)], 1, max)

  data[, paste0("choice.", letters[x])] <<- ifelse(
    data[, grep(names(data), pattern = paste0("choice_", letters[x]), value = T)] == 1,
  data[, seqmax[x]], data[, seqmax[x] + 1])

  data[, paste0("mean.", letters[x])] <<- rowMeans(
    data[, grep(names(data), pattern = paste0("outcome_\\d+_", letters[x]), value = T)])

})

Upvotes: 1

Eric Watt
Eric Watt

Reputation: 3240

You can select columns with some regex. For example, to get your max.a value.

library(data.table)

setDT(data)
data[, do.call(pmax, .SD), .SDcols = names(data) %like% "\\d+_a$"]

[1] 41 78 39

Alternatively, you could select your columns with some regex outside of the data.table. Lots of ways to go about this.

Similar application to your last command.

data[, 
     .(Mean = rowMeans(.SD)), 
     by = observation, 
     .SDcols = names(data) %like% "^outcome"]
   observation  Mean
1:           1 37.50
2:           2 39.75
3:           3 32.75

For choice.a, how would you choose between b, c, d, e etc?

Upvotes: 1

Related Questions