How to group sequential event time sequences (with breaks between events) to find duration of events

Question

I have a data set in R with a series of people, events that occur and an assigned time that they occur in seconds, starting from 0. It looks similar to this:

event seconds person
1      0.0    Bob
2     15.0    Bob
3     28.5    Bob
4     32.0    Joe
5     38.0    Joe
6     41.0    Joe
7     42.5    Joe
8     55.0    Anne
9     58.0    Anne

I need to filter for each name, and that means the ordered events will not be sequential for each person.

An example of what this looks like (notice how Bob is not involved in events 4-40, etc.):

event seconds person
1      0.0     Bob
2      15.0    Bob
3      28.5    Bob
41     256.0   Bob
42     261.0   Bob
43     266.0   Bob
44     268.5   Bob
45     272.0   Bob
46     273.0   Bob
49     569.0   Bob
80     570.5   Bob
81     581.0   Bob

The events that are sequential and related are separated by an increment of 1. I would like to find the duration of the related events, for example, events 1-3 is a group that would be 28.5 seconds. Events 41-46 is another group that lasts 17 seconds. This would be required for all the names that are listed in the person column.

I have tried filtering the names using dplyr and then finding the difference between event rows, using as.matrix, and determining where the increment is greater than 1 (indicating it's no longer part of the current sequence of events). I haven't found a way to assign the max and min based off of this to determine the duration of related events. The solution does not need to involve this step though, but it was the closest I could come.

The end goal is to plot the non-contiguous time durations for each person to have a visual representation of each person's event involvement for the entire data set.

Thank you in advance.

mathematical.coffee · Accepted Answer

Suppose first we have just Bob's rows of the dataframe, called bob. We will assume bob is already ordered by event, increasing.

Along the same lines as you mentioned (looking at diff(event) > 1), you can additionally use cumsum to group each event to the 'run' of events it belongs to:

library(plyr)
bob2 <- mutate(bob, start = c(1, diff(bob$event) > 1), run=cumsum(start))
   event seconds person start run
1      1     0.0    Bob     1   1
2      2    15.0    Bob     0   1
3      3    28.5    Bob     0   1
4     41   256.0    Bob     1   2
5     42   261.0    Bob     0   2
6     43   266.0    Bob     0   2
7     44   268.5    Bob     0   2
8     45   272.0    Bob     0   2
9     46   273.0    Bob     0   2
10    49   569.0    Bob     1   3
11    80   570.5    Bob     1   4
12    81   581.0    Bob     0   4

start indicates whether this starts a run of sequential events, and run is which such set of events we are in.

Then you can just find the duration:

ddply(bob2, .(run), summarize, length=diff(range(seconds)))
  run length
1   1   28.5
2   2   17.0
3   3    0.0
4   4   10.5

Now supposing you have your original dataframe with everyone mixed together in it, we can use ddply again to split it up by person:

tmp <- ddply(df, .(person), transform, run=cumsum(c(1, diff(event) != 1)))
ddply(tmp, .(person, run), summarize, length=diff(range(seconds)), start_event=first(event), end_event=last(event))

    person run length start_event end_event
1   Anne   1    3.0           8         9
2    Bob   1   28.5           1         3
3    Bob   2   17.0          41        46
4    Bob   3    0.0          49        49
5    Bob   4   10.5          80        81
6    Joe   1   10.5           4         7

Note: my df is your bob table rbind-ed to your other table, unique()d (just to show it works when there are more than one run per person). There is probably a clever way to do this that combines the two ddply calls (or uses the dplyr pipe-y syntax that I am not familiar with), but I do not know what it is.

How to group sequential event time sequences (with breaks between events) to find duration of events

Answers (2)

Related Questions