dmitriy873
dmitriy873

Reputation: 121

R - Find a sequence of row elements based on time constraints in a dataframe

Consider the following dataframe (ordered by id and time):

df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
   id event time
1   1     a    1
2   1     b    3
3   1     b    6
4   1     b   12
5   1     a   24
6   1     b   30
7   1     a   42
8   2     a    1
9   2     a    2
10  2     b    6
11  2     a   17
12  2     a   24

I want to count how many times a given sequence of events appears in each "id" group. Consider the following sequence with time constraints:

seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)

It means that event "a" can start at any time, event "b" must start no earlier than 2 and no later than 8 after event "a", another event "a" must start no earlier than 12 and no later than 18 after event "b". Some rules for creating sequences:

  1. Events don't need to be consecutive with respect to "time" column. For example, seq can be constructed from rows 1, 3, and 5.
  2. To be counted, sequences must have different first event. For example, if seq = rows 8, 10, and 11 was counted, then seq = rows 8, 10, and 12 must not be counted.
  3. The events may be included in many constructed sequences if they do not violate the second rule. For example, we count both sequences: rows 1, 3, 5 and rows 5, 6, 7.

The expected result:

df1
  id count
1  1     2
2  2     2

There are some related questions in R - Identify a sequence of row elements by groups in a dataframe and Finding rows in R dataframe where a column value follows a sequence.

Is it a way to solve the problem using "dplyr"?

Upvotes: 2

Views: 1729

Answers (2)

kdpsingh
kdpsingh

Reputation: 76

I believe this is what you're looking for. It gives you the desired output. Note that there is a typo in your original question where you have a 32 instead of a 42 when you define the time column in df. I say this is a typo because it doesn't match your output immediately below the definition of df. I changed the 32 to a 42 in the code below.

library(dplyr)

df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))

seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)

df %>% 
  full_join(df,by='id',suffix=c('1','2')) %>% 
  full_join(df,by='id') %>% 
  rename(event3 = event, time3 = time) %>%
  filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>% 
  filter(time1 %>% between(time_LB[1],time_UB[1])) %>% 
  filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>% 
  filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
  group_by(id,time1) %>%
  slice(1) %>%   # slice 1 row for each unique id and time1 (so no duplicate time1s)
  group_by(id) %>% 
  count()

Here's the output:

# A tibble: 2 x 2
     id     n
  <dbl> <int>
1     1     2
2     2     2

Also, if you omit the last 2 parts of the dplyr pipe that do the counting (to see the sequences it is matching), you get the following sequences:

Source: local data frame [4 x 7]
Groups: id, time1 [4]

     id event1 time1 event2 time2 event3 time3
  <dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1     1      a     1      b     6      a    24
2     1      a    24      b    30      a    42
3     2      a     1      b     6      a    24
4     2      a     2      b     6      a    24

EDIT IN RESPONSE TO COMMENT REGARDING GENERALIZING THIS: Yes it is possible to generalize this to arbitrary length sequences but requires some R voodoo. Most notably, note the use of Reduce, which allows you to apply a common function on a list of objects as well as foreach, which I'm borrowing from the foreach package to do some arbitrary looping. Here's the code:

library(dplyr)
library(foreach)

df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))

seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)

multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df} 
df2 = Reduce(multi_full_join,df_list)

names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)

df2 = df2 %>% 
  mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>% 
  filter(seq_string == paste0(seq,collapse=''))

time_diff = df2 %>% select(grep('time',names(df2))) %>%
  t %>%
  as.data.frame() %>%
  lapply(diff) %>% 
  unlist %>%  matrix(ncol=2,byrow=TRUE) %>% 
  as.data.frame

foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
  time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>% 
  Reduce(`&`,.) %>% 
  which %>% 
  slice(df2,.) %>% 
  filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
  group_by(id,time1) %>%
  slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)

This outputs the following:

Source: local data frame [4 x 8]
Groups: id, time1 [4]

     id event1 time1 event2 time2 event3 time3 seq_string
  <dbl>  <chr> <dbl>  <chr> <dbl>  <chr> <dbl>      <chr>
1     1      a     1      b     6      a    24        aba
2     1      a    24      b    30      a    42        aba
3     2      a     1      b     6      a    24        aba
4     2      a     2      b     6      a    24        aba

If you want just the counts, you can group_by(id) then count() as in the original code snippet.

Upvotes: 3

sirallen
sirallen

Reputation: 1966

Perhaps it's easier to represent event sequences as strings and use regex:

df.str = lapply(split(df, df$id), function(d) {
    z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })

df.str = lapply(df.str, paste, collapse='')

# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"


df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))

> data.frame(id=names(df1), count=unlist(df1))
#   id count
# 1  1     2
# 2  2     2

Upvotes: 2

Related Questions