colin
colin

Reputation: 2666

chaining together sequential observations with only current and immediately prior ID values in R

Say I have some data on traits of individuals measured over time, that looks like this:

present <- c(1:4)
pre.1   <- c(5:8)
pre.2   <- c(9:12)
present2 <- c(13:16)

id      <- c(present,pre.1,pre.2,present2)
prev.id <- c(pre.1,pre.2,rep(NA,8))
trait   <- rnorm(16,10,3)

d      <- data.frame(id,prev.id,trait)

print d:

   id prev.id     trait
1   1       5 10.693266
2   2       6 12.059654
3   3       7  3.594182
4   4       8 14.411477
5   5       9 10.840814
6   6      10 13.712924
7   7      11 11.258689
8   8      12 10.920899
9   9      NA 14.663039
10 10      NA  5.117289
11 11      NA  8.866973
12 12      NA 15.508879
13 13      NA 14.307738
14 14      NA 15.616640
15 15      NA 10.275843
16 16      NA 12.443139

Every observations has a unique value of id. However, some individuals have been observed in the past, and so I also have an observation of prev.id. This allows me to connect an individual with its current and past values of trait. However, some individuals have been remeasured multiple times. Observations 1-4 have previous IDs of 5-8, and observations of 5-8 have previous IDs of 9-12. Observations 9-12 have no previous ID because this is the first time these were measured. Furthermore, observations 13-16 have never been measured before. So, observations 1:4 are unique individuals, observations 5-12 are prior observations of individuals 1-4, and observations 13-16 are another set of unqiue individuals, distinct from 1-4. I would like to write code to generate a table that has every unique individual, as well as every past observation of that individuals trait. The final output would look like:

id           <- c(1:4,13:16)
prev.id      <- c(5:8, rep(NA,4))
trait        <- d$trait[c(1:4,13:16)]
prev.trait.1 <- d$trait[c(5:8 ,rep(NA,4))]
prev.trait.2 <- d$trait[c(9:12,rep(NA,4))]
output<- data.frame(id,prev.id,trait,prev.trait.1,prev.trait.2)
> output
  id prev.id     trait prev.trait.1 prev.trait.2
1  1       5 10.693266     10.84081    14.663039
2  2       6 12.059654     13.71292     5.117289
3  3       7  3.594182     11.25869     8.866973
4  4       8 14.411477     10.92090    15.508879
5 13      NA 14.307738           NA           NA
6 14      NA 15.616640           NA           NA
7 15      NA 10.275843           NA           NA
8 16      NA 12.443139           NA           NA

I can accomplish this in a straightforward manner, but it requires me coding an additional pairing for each previous observation, such that the number of code groups I need to write is the number of times any individual has been recorded. This is a pain, as in the data set I am applying this problem to, there may be anywhere from 0-100 previous observations of an individual.

#first pairing
d.prev <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev) <- c('prev.id','prev.trait.1','prev.id.2')
d <- merge(d,d.prev, by = 'prev.id',all.x=T)

#second pairing
d.prev2 <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev2) <- c('prev.id.2','prev.trait.2','prev.id.3')
d<- merge(d,d.prev2,by='prev.id.2',all.x=T)

#remove observations that are another individuals previous observation
d <- d[!(d$id %in% d$prev.id),]

How can I go about doing this in fewer lines, so I don't need 100 code chunks to cover individuals that have been remeasured 100 times?

Upvotes: 3

Views: 77

Answers (1)

A. Webb
A. Webb

Reputation: 26466

What you have is a forest of linear lists. We'll start at the terminal ends

roots<-d$id[is.na(d$prev.id)]

And determine the paths backwards

path <- function(node) {
  a <- integer(nrow(d))
  i <- 0
  while(!is.na(node)) {
    i <- i+1
    a[i] <- node
    node <- d$id[match(node,d$prev.id)]
  }
  return(rev(a[1:i]))
}

Then we can get a 'stacked' representation of your desired output with

x<-do.call(rbind,lapply(roots,
  function(r) {p<-path(r); data.frame(id=p[[1]],seq=seq_along(p),traits=d$trait[p])}))

And then use reshape2::dcast to get it in the desired shape

library(reshape2)
dcast(x,id~seq,fill=NA,value.var='traits')
  id         1        2         3
1  1 10.693266 10.84081 14.663039
2  2 12.059654 13.71292  5.117289
3  3  3.594182 11.25869  8.866973
4  4 14.411477 10.92090 15.508879
5 13 14.307738       NA        NA
6 14 15.616640       NA        NA
7 15 10.275843       NA        NA
8 16 12.443139       NA        NA

I leave it to you to adapt column names.

Upvotes: 2

Related Questions