Reputation: 2282
I would like to identify the duration of an activity that start at t1 and end at t7. The starting point is t1 that records the occurrence of an activity at t1_1 , t1_2, t1_3 and so on. For example in the case of id 12 activity occurred at t1_2 and t1_3 (i would like to save this) t2_2 (as there is no activity before and after I am not intrested in this activity), t3_1 (same as t2_2), t3_3, t4_2, t5_2, t6_1, t6_2, t6_3 and t7_3. I would like to identify to the start and end all id's in which an activity occured, the duration and the most frequent one.
Input:
id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 1
123 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1
10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Output for id 12
Id Start/End Duration Frequency
12 t1_1, t1_3 2 1
12 t6_1, t6_3 3 1
One way to dot this is using the bioconductor library but are there any better solution?
Sample data
df1 <- structure(list(id = c(12L, 123L, 10L), t1_1 = c(0L, 0L, 1L),
t1_2 = c(1L, 0L, 1L), t1_3 = c(1L, 0L, 1L), t2_1 = c(0L,
1L, 1L), t2_2 = c(1L, 1L, 1L), t2_3 = c(0L, 1L, 1L), t3_1 = c(1L,
0L, 1L), t3_2 = c(0L, 0L, 1L), t3_3 = c(1L, 0L, 1L), t4_1 = c(0L,
1L, 1L), t4_2 = c(1L, 1L, 1L), t4_3 = c(0L, 1L, 1L), t5_1 = c(0L,
1L, 1L), t5_2 = c(1L, 1L, 1L), t5_3 = c(0L, 1L, 1L), t6_1 = c(1L,
0L, 1L), t6_2 = c(1L, 0L, 1L), t6_3 = c(1L, 0L, 1L), t7_1 = c(0L,
1L, 1L), t7_2 = c(0L, 1L, 1L), t7_3 = c(1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))
Upvotes: 2
Views: 159
Reputation: 887048
We convert to 'long' format with pivot_longer
, then create a grouping variable with rleid
(from data.table
) based on the occurrence of similar adjacent elements in 'value', filter
the rows where the 'value' is 1, grouped by 'id', 'grp', we keep only rows where the frequency count is greater than 1, summarise
by paste
ing (str_c
) the first
and last
elements of 'name' as well get the count (n()
) and arrange
if necessary
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df1 %>%
pivot_longer(cols = -id) %>%
mutate(grp = rleid(value)) %>%
filter(as.logical(value)) %>%
group_by(id, grp) %>%
filter(n() > 1) %>%
summarise(Start_End = str_c(first(name), last(name), sep=", "),
Duration = n()) %>%
arrange(id, grp)
Upvotes: 1
Reputation: 12703
library('data.table')
df1 <- melt(setDT(df1), id.var = 'id')
df1[, c('time', 'subtime') := tstrsplit(as.character(variable), "_", fixed = TRUE)]
df2 <- df1[, rle(value), by = .(id, time)][lengths > 1 & values == 1, ]
df3 <- df1[df2, on = c('id', 'time')]
df3 <- df3[, .(`Start/End` = paste0(time, '_', c(min(subtime), max(subtime)), collapse = " - "),
Duration = unique(lengths)),
by = .(id, time)]
df3[, Frequency := .N, by = .(id, `Start/End`)]
df3[, time := NULL]
df3[order(id), ]
# id Start/End Duration Frequency
# 1: 10 t1_1 - t1_3 3 1
# 2: 10 t2_1 - t2_3 3 1
# 3: 10 t3_1 - t3_3 3 1
# 4: 10 t4_1 - t4_3 3 1
# 5: 10 t5_1 - t5_3 3 1
# 6: 10 t6_1 - t6_3 3 1
# 7: 10 t7_1 - t7_3 3 1
# 8: 12 t1_1 - t1_3 2 1
# 9: 12 t6_1 - t6_3 3 1
# 10: 123 t2_1 - t2_3 3 1
# 11: 123 t4_1 - t4_3 3 1
# 12: 123 t5_1 - t5_3 3 1
# 13: 123 t7_1 - t7_3 3 1
Upvotes: 1