Restructuring time-spell data efficiently in R

Question

I have a dataset consisting of states experienced over spells of time (e.g. individual i was in a given state from time period 1 through 4, etc). The data are expressed as start and end period for each state. Individuals can experience an arbitrary number of spells. For analysis I need to restructure to an individual-time period matrix, i.e. one row per individual, one column per time period, each cell specifying the state experienced by that individual in that period, or NA if no state is recorded). I can do this via an ugly double loop but I am looking for a more efficient solution (in my full dataset, this is a bit slow - but mostly I am just wondering if there is a better/canonical way to do this, perhaps a tidyverse approach).

Sample data:

data <- rbind(data.frame(id = 1:100,
                value = sample(c("a","b","c"),100, replace =T),
                start = sample(1:5,100, replace = T)),
              data.frame(id = sample(1:100, 50, replace = T),
                value = sample(c("a","b","c"),50, replace =T),
                start = sample(11:15,50, replace = T)))
data$end = data$start + sample(1:5, 150, replace = T)

Loop solution:

result <- matrix(NA, nrow=100, ncol=20)

for (i in 1:nrow(data)){
  id <- data$id[i]
  for (j in data$start[i]:data$end[i]){                             
    result[id,j] <- data$value[i]
  }
}

In the test dataset, half the individuals have 1 spell and half have 2 but as noted in the original data there can be an arbitrary number of spells per individual. Also for simplicity of generating test data all individuals experiencing two spells have the second spell in a later time window (after period 11) to avoid the complication of overlapping spells, but this 'two window' feature is not a property of the real data. However we can assume spells do not overlap.

Ronak Shah · Accepted Answer

Here is a tidyverse approach :

Create a sequence between start and end numbers.
unnest the data to get it in long format.
Keep only unique combinations of id and num value.
Cast the data in wide format.

library(tidyverse)

data %>%
  mutate(num = map2(start, end, seq)) %>%
  unnest(num) %>%
  select(-start, -end) %>%
  distinct(id, num, .keep_all = TRUE) %>%
  arrange(num) %>%
  pivot_wider(names_from = num, values_from = value) %>%
  arrange(id)


#      id `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`   `10` 
#             
# 1     1 NA    NA    NA    NA    c     c     c     NA    NA    NA   
# 2     2 NA    NA    NA    NA    b     b     b     NA    NA    NA   
# 3     3 NA    NA    b     b     b     b     b     NA    NA    NA   
# 4     4 NA    b     b     b     b     NA    NA    NA    NA    NA   
# 5     5 NA    c     c     c     c     c     c     NA    NA    NA   
# 6     6 NA    a     a     a     NA    NA    NA    NA    NA    NA   
# 7     7 NA    NA    NA    b     b     b     b     NA    NA    NA   
# 8     8 NA    b     b     b     b     NA    NA    NA    NA    NA   
# 9     9 NA    a     a     a     a     NA    NA    NA    NA    NA   
#10    10 NA    NA    NA    b     b     NA    NA    NA    NA    NA   
# … with 90 more rows, and 10 more variables: `11` ,
#   `12` , `13` , `14` , `15` , `16` ,
#   `17` , `18` , `19` , `20`

Restructuring time-spell data efficiently in R

Answers (1)

Related Questions