Reputation: 367
I have a data set in R with a series of people, events that occur and an assigned time that they occur in seconds, starting from 0. It looks similar to this:
event seconds person
1 0.0 Bob
2 15.0 Bob
3 28.5 Bob
4 32.0 Joe
5 38.0 Joe
6 41.0 Joe
7 42.5 Joe
8 55.0 Anne
9 58.0 Anne
I need to filter for each name, and that means the ordered events will not be sequential for each person.
An example of what this looks like (notice how Bob is not involved in events 4-40, etc.):
event seconds person
1 0.0 Bob
2 15.0 Bob
3 28.5 Bob
41 256.0 Bob
42 261.0 Bob
43 266.0 Bob
44 268.5 Bob
45 272.0 Bob
46 273.0 Bob
49 569.0 Bob
80 570.5 Bob
81 581.0 Bob
The events that are sequential and related are separated by an increment of 1. I would like to find the duration of the related events, for example, events 1-3 is a group that would be 28.5 seconds. Events 41-46 is another group that lasts 17 seconds. This would be required for all the names that are listed in the person column.
I have tried filtering the names using dplyr and then finding the difference between event rows, using as.matrix, and determining where the increment is greater than 1 (indicating it's no longer part of the current sequence of events). I haven't found a way to assign the max and min based off of this to determine the duration of related events. The solution does not need to involve this step though, but it was the closest I could come.
The end goal is to plot the non-contiguous time durations for each person to have a visual representation of each person's event involvement for the entire data set.
Thank you in advance.
Upvotes: 2
Views: 1266
Reputation: 56905
Suppose first we have just Bob's rows of the dataframe, called bob
.
We will assume bob
is already ordered by event
, increasing.
Along the same lines as you mentioned (looking at diff(event) > 1
), you can additionally use cumsum
to group each event to the 'run' of events it belongs to:
library(plyr)
bob2 <- mutate(bob, start = c(1, diff(bob$event) > 1), run=cumsum(start))
event seconds person start run
1 1 0.0 Bob 1 1
2 2 15.0 Bob 0 1
3 3 28.5 Bob 0 1
4 41 256.0 Bob 1 2
5 42 261.0 Bob 0 2
6 43 266.0 Bob 0 2
7 44 268.5 Bob 0 2
8 45 272.0 Bob 0 2
9 46 273.0 Bob 0 2
10 49 569.0 Bob 1 3
11 80 570.5 Bob 1 4
12 81 581.0 Bob 0 4
start
indicates whether this starts a run of sequential events, and run
is which such set of events we are in.
Then you can just find the duration:
ddply(bob2, .(run), summarize, length=diff(range(seconds)))
run length
1 1 28.5
2 2 17.0
3 3 0.0
4 4 10.5
Now supposing you have your original dataframe with everyone mixed together in it, we can use ddply
again to split it up by person:
tmp <- ddply(df, .(person), transform, run=cumsum(c(1, diff(event) != 1)))
ddply(tmp, .(person, run), summarize, length=diff(range(seconds)), start_event=first(event), end_event=last(event))
person run length start_event end_event
1 Anne 1 3.0 8 9
2 Bob 1 28.5 1 3
3 Bob 2 17.0 41 46
4 Bob 3 0.0 49 49
5 Bob 4 10.5 80 81
6 Joe 1 10.5 4 7
Note: my df
is your bob table rbind-ed to your other table, unique()
d (just to show it works when there are more than one run per person).
There is probably a clever way to do this that combines the two ddply
calls (or uses the dplyr
pipe-y syntax that I am not familiar with), but I do not know what it is.
Upvotes: 1
Reputation: 132651
Use this:
DF <- read.table(text = "event seconds person
1 0.0 Bob
2 15.0 Bob
3 28.5 Bob
41 256.0 Bob
42 261.0 Bob
43 266.0 Bob
44 268.5 Bob
45 272.0 Bob
46 273.0 Bob
49 569.0 Bob
80 570.5 Bob
81 581.0 Bob", header = TRUE)
DF$personEvent <- cumsum(c(1L, diff(DF$event)) != 1L)
# event seconds person personEvent
#1 1 0.0 Bob 0
#2 2 15.0 Bob 0
#3 3 28.5 Bob 0
#4 41 256.0 Bob 1
#5 42 261.0 Bob 1
#6 43 266.0 Bob 1
#7 44 268.5 Bob 1
#8 45 272.0 Bob 1
#9 46 273.0 Bob 1
#10 49 569.0 Bob 2
#11 80 570.5 Bob 3
#12 81 581.0 Bob 3
Since I'm not a follower of the great pipe, I leave the rest to you.
Upvotes: 4