group by sequence of events and get summary statistics for each sequence

Question

I have a data.frame with log of sequences of events. Here, sequence 1 is composed of event A, then B, then C, each starting at a specific timestamp (in seconds).

df=data.frame(id=runif(10, 1e6, 1e7), sequence = c(1,1,1,2,2,3,3,3,4,4), event=c("A", "B", "C", "B", "C", "A", "B", "C", "B", "C"), starts_at=c(20,22,24,20,30,20,21,23,20,40))

What I want is to group my data.frame by type of sequence (there are dozens of types, length 2 to 6): A->B->C or B->C, and then to get some results on those types. Desired output would be:

####                      sequence_type number.appearances mean.delay.between.events
####                    1           ABC                  2                   1.5 / 2
####                    2            BC                  2                        15

The last column "mean delay" would be a string composed of the mean diff time between successive events in a sequence: in ABC sequence, there is 1.5 seconds in average between A and B, and 2 between B and C. I also thought of "spreading" each mean difference in a new column diff.1, diff.2..., but seems complicated since sequence have different lengths. Though i'm open to different ways of presenting this information..

So far I've come up with:

library(dplyr)
df %>% group_by(sequence) %>% arrange(starts_at) %>% summarise(sequence_type = paste0(event, collapse="")) %>% group_by(sequence_type) %>% tally

I didn't find how to achieve the second part. Thanks for the help...

Alejandro Andrade · Accepted Answer

This might not bee the elegant solution you would get with dplyr but I think is general enough that it would work with your real data.
First you just need to get the corresponding sequence of each row of your data, that is ayuda_seq

library(zoo)
df=data.frame(id=runif(14, 1e6, 1e7), sequence = c(1,1,1,2,2,3,3,3,4,4,5,5,5,5), 
              event=c("A", "B", "C", "B", "C", "A", "B", "C", "B", "C","A","B","C","D"), 
              starts_at=c(20,22,24,20,30,20,21,23,20,40,20,22,21,15))
ayuda_seq = sapply(df$sequence, function(x) paste0(df[df$sequence == x,3],collapse = ""))

and then you just loop through the unique sequences and generate the sub sequence by each 2 elements.

vec_means = NULL
for(x in unique(ayuda_seq)){
  data_temp = df[ayuda_seq == x,]
  diff_temp = diff(data_temp$starts_at)
  temp_sub = apply(rollapply(data_temp[,3],FUN = paste0,width = 2),1,paste0,collapse = "")
  mean_temp = aggregate(diff_temp,by = list(temp_sub),mean)
  if(all(!duplicated(temp_sub))){
    averages = paste0(mean_temp[,2],collapse = " / ")
  } else{
    averages = paste0(mean_temp[match(temp_sub[duplicated(temp_sub)],mean_temp[,1]),2],collapse = " / ")
  }
  vec_means = c(vec_means,averages)
}


df_res = data.frame(sequence_type = unique(ayuda_seq),
                    number.appearances = as.numeric(table(ayuda_seq)/nchar(unique(ayuda_seq))),
                    mean.delay.between.events = vec_means)

the variable temp_sub will have the different combinations within the original string you are looping. In the case of "ABC" there is a possible combination of "CA" which is not taking in consideration because it is unique.

group by sequence of events and get summary statistics for each sequence

Answers (2)

Related Questions