Reputation: 418
I have a dataset consisting of states experienced over spells of time (e.g. individual i was in a given state from time period 1 through 4, etc). The data are expressed as start and end period for each state. Individuals can experience an arbitrary number of spells. For analysis I need to restructure to an individual-time period matrix, i.e. one row per individual, one column per time period, each cell specifying the state experienced by that individual in that period, or NA if no state is recorded). I can do this via an ugly double loop but I am looking for a more efficient solution (in my full dataset, this is a bit slow - but mostly I am just wondering if there is a better/canonical way to do this, perhaps a tidyverse approach).
Sample data:
data <- rbind(data.frame(id = 1:100,
value = sample(c("a","b","c"),100, replace =T),
start = sample(1:5,100, replace = T)),
data.frame(id = sample(1:100, 50, replace = T),
value = sample(c("a","b","c"),50, replace =T),
start = sample(11:15,50, replace = T)))
data$end = data$start + sample(1:5, 150, replace = T)
Loop solution:
result <- matrix(NA, nrow=100, ncol=20)
for (i in 1:nrow(data)){
id <- data$id[i]
for (j in data$start[i]:data$end[i]){
result[id,j] <- data$value[i]
}
}
In the test dataset, half the individuals have 1 spell and half have 2 but as noted in the original data there can be an arbitrary number of spells per individual. Also for simplicity of generating test data all individuals experiencing two spells have the second spell in a later time window (after period 11) to avoid the complication of overlapping spells, but this 'two window' feature is not a property of the real data. However we can assume spells do not overlap.
Upvotes: 0
Views: 142
Reputation: 389155
Here is a tidyverse
approach :
start
and end
numbers.unnest
the data to get it in long format.id
and num
value.library(tidyverse)
data %>%
mutate(num = map2(start, end, seq)) %>%
unnest(num) %>%
select(-start, -end) %>%
distinct(id, num, .keep_all = TRUE) %>%
arrange(num) %>%
pivot_wider(names_from = num, values_from = value) %>%
arrange(id)
# id `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 NA NA NA NA c c c NA NA NA
# 2 2 NA NA NA NA b b b NA NA NA
# 3 3 NA NA b b b b b NA NA NA
# 4 4 NA b b b b NA NA NA NA NA
# 5 5 NA c c c c c c NA NA NA
# 6 6 NA a a a NA NA NA NA NA NA
# 7 7 NA NA NA b b b b NA NA NA
# 8 8 NA b b b b NA NA NA NA NA
# 9 9 NA a a a a NA NA NA NA NA
#10 10 NA NA NA b b NA NA NA NA NA
# … with 90 more rows, and 10 more variables: `11` <chr>,
# `12` <chr>, `13` <chr>, `14` <chr>, `15` <chr>, `16` <chr>,
# `17` <chr>, `18` <chr>, `19` <chr>, `20` <chr>
Upvotes: 1