aelhak
aelhak

Reputation: 415

generate variable based on first occurrence of a value

I have 5 repeat measures called pub1:pub5 each taking a value of 1 to 4. Each was measured at a different age age1:age5. That is, pub1 was measured at age1....pub5 at age5 etc.

I would like to create a new variable age_pb2 that shows the age at which a value of 2 first occurred in pub. For example, for individual x, age_pb2 will equal age3 if the first time a value of 2 is scored is in pub3

I have tried modifying previous code but not had much luck.

library(tidyverse)
#Example data
N <- 2000
data <- data.frame(id = 1:2000,age1 = rnorm(N,6:8),age2 = rnorm(N,7:9),age3 = rnorm(N,8:10),
               age4 = rnorm(N,9:11),age5 = rnorm(N,10:12),pub1 = rnorm(N,1:2),pub2 = rnorm(N,1:2),
               pub3 = rnorm(N,1:2),pub4 = rnorm(N,1:2),pub5 = rnorm(N,1:2))

 data <- data %>% mutate_at(vars(starts_with("pub")), funs(round(replace(., .< 0, NA), 0)))

#New variable showing first age at getting a score of 2 (doesn't work)

i1 <- grepl('^pub', names(data)) # index for pub columns
i2 <- grepl('^age', names(data)) # index for age columns

data[paste0("age_pb2")] <- lapply(2, function(i) {
j1 <- max.col(data[i1] == i, 'first')
j2 <- rowSums(data[i1] == i) == 0
data[i2][cbind(seq_len(nrow(data)), j1 *(NA^j2))]
})

Upvotes: 0

Views: 561

Answers (1)

Nic
Nic

Reputation: 383

set.seed(1)
N <- 2000
data <- data.frame(id = 1:2000,age1 = rnorm(N,6:8),age2 = rnorm(N,7:9),age3 = rnorm(N,8:10),
                   age4 = rnorm(N,9:11),age5 = rnorm(N,10:12),pub1 = rnorm(N,1:2),pub2 = rnorm(N,1:2),
                   pub3 = rnorm(N,1:2),pub4 = rnorm(N,1:2),pub5 = rnorm(N,1:2)) %>% 
  mutate_at(vars(starts_with("pub")), funs(round(replace(., .< 0, NA), 0))) %>%
  mutate(age_pb2 = eval(parse(text = paste0("age", which.min(apply(select(., starts_with("pub")), 2, function(x) which(x == 2)[1]))))))

The way it works, you apply over the pubs columns and take with which(x == 2)[1] the first matched row per column, then take the which.min to get the column index number (of pub respectively age) which you then paste with "age" to assign (using eval(parse(text = variable name))) the respective column.

E.g. here after apply you get

[pub1 = 2,  pub2 = 1, pub3 = 2, pub4 = 4, pub5 = 2]

which is the first occurrence of 2 per column. The earliest (which.min) occurrence is for the second pub column, thus index is 2. This pasted with "age" and eval parsed to mutate.

EDIT

It is probably more convenient to do it in a for loop for all age_pbi, or there is an easy solution in dplyr that I am not aware of.

for (i in 1:5) {
  index <- which.min(apply(select(data, starts_with("pub")), 2, function(x) which(x == i)[1]))
  data[ ,paste0("age_pb", i)] <- data[ ,paste0("age", index)]
}

Note however, that which.min takes the first minimum. E.g. pub1 and pub2 both have a 1 in the first row, so the above approach assigns age1 to age_pb1 whereas it could be age2 as well. I don't know what you want to do with this, so can't say what is a better option.

Upvotes: 1

Related Questions