Reputation: 415
I need to identify a series (maximum 3 events) of events that occurred within 60 seconds.
Here there is the IN data
IN<-read.table(header = FALSE, text = "
2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")
IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")
and here there is the desired output
OUT<-read.table(header = FALSE, text = "
2018-06-01_04:29:47 1
2018-06-01_05:44:41 1
2018-06-01_05:44:43 2
2018-06-01_05:44:45 3
2018-06-01_05:57:54 1
2018-06-01_05:57:56 2
2018-06-01_05:57:58 3
2018-06-01_08:10:35 1
2018-06-01_08:41:20 1
2018-06-01_08:41:22 2
2018-06-01_08:41:24 3
2018-06-01_08:52:01 1
2018-06-01_09:02:13 1
2018-06-01_09:22:45 1
",quote="\n",col.names=c("time","response"))
I have searched for similar questions, but unsuccessfully. I guess that function diff is the first step for solving this problem,
response<-as.numeric(diff(IN$time)>60)
but than I have no idea how to proceed to get the desired output.
Any helps will be appreciated.
Upvotes: 0
Views: 973
Reputation: 5405
Here's a data frame with some edge cases:
IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:44:47
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_05:58:56
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")
IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")
IN
time
1 2018-06-01 04:29:47
2 2018-06-01 05:44:41
3 2018-06-01 05:44:43
4 2018-06-01 05:44:45
5 2018-06-01 05:44:47
6 2018-06-01 05:57:54
7 2018-06-01 05:57:56
8 2018-06-01 05:57:58
9 2018-06-01 05:58:56
10 2018-06-01 08:10:35
11 2018-06-01 08:41:20
12 2018-06-01 08:41:22
13 2018-06-01 08:41:24
14 2018-06-01 08:52:01
15 2018-06-01 09:02:13
16 2018-06-01 09:22:45
You'll notice line 9
is a minute after the mid-group time but not the reference time. Line 5
is also the 4th member of what would be a group if there were no limits imposed.
Here's my solution using dplyr
. I think it works generally speaking:
res <- IN %>% mutate(diffs = as.numeric(time - lag(time)),
helper1 = case_when(is.na(diffs) ~ 1,
diffs <= 60 ~ 0 ,
TRUE ~ 1),
grouper1 = cumsum(helper1)) %>%
group_by(grouper1) %>%
mutate(helper2 = cumsum(diffs) - first(diffs),
helper3 = helper2 %/% 60,
helper4 = helper1 + if_else(is.na(helper3), 0, helper3)) %>%
ungroup() %>%
mutate(grouper2 = cumsum(helper4)) %>%
group_by(grouper2) %>%
mutate(rn0 = row_number() - 1,
grouper3 = rn0 %/% 3) %>%
group_by(grouper2, grouper3) %>%
mutate(count = row_number()) %>%
ungroup() %>%
select(time, count)
the result:
> res
# A tibble: 16 x 2
time count
<dttm> <int>
1 2018-06-01 04:29:47 1
2 2018-06-01 05:44:41 1
3 2018-06-01 05:44:43 2
4 2018-06-01 05:44:45 3
5 2018-06-01 05:44:47 1
6 2018-06-01 05:57:54 1
7 2018-06-01 05:57:56 2
8 2018-06-01 05:57:58 3
9 2018-06-01 05:58:56 1
10 2018-06-01 08:10:35 1
11 2018-06-01 08:41:20 1
12 2018-06-01 08:41:22 2
13 2018-06-01 08:41:24 3
14 2018-06-01 08:52:01 1
15 2018-06-01 09:02:13 1
16 2018-06-01 09:22:45 1
I think i structured the dplyr
calls in a way where you can follow them, but if you have questions feel free to post in comments.
Upvotes: 1
Reputation: 12074
Here's a solution using dplyr
, magrittr
, and lubridate
packages.
IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")
IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")
I've removed the blank first line of the input data frame, as it caused problems. The following function filters the data frame to those elements within 60 seconds before the given ref_time
and counts the number of rows using nrow
.
event_count <- function(ref_time){
IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow
}
Here, I apply the function in a row-wise fashion, record the counts, and sort according to time. (Probably unnecessary...) The results are piped back in to the input data frame using the compound assignment pipe from magrittr
.
IN %<>%
rowwise() %>%
mutate(counts = event_count(time)) %>%
arrange(time)
Finally, the results.
# A tibble: 14 x 2
# time counts
# <dttm> <int>
# 1 2018-06-01 04:29:47 1
# 2 2018-06-01 05:44:41 1
# 3 2018-06-01 05:44:43 2
# 4 2018-06-01 05:44:45 3
# 5 2018-06-01 05:57:54 1
# 6 2018-06-01 05:57:56 2
# 7 2018-06-01 05:57:58 3
# 8 2018-06-01 08:10:35 1
# 9 2018-06-01 08:41:20 1
# 10 2018-06-01 08:41:22 2
# 11 2018-06-01 08:41:24 3
# 12 2018-06-01 08:52:01 1
# 13 2018-06-01 09:02:13 1
# 14 2018-06-01 09:22:45 1
I think what @PoGibas is alluding to is for some reason there are two entries with the time 2018-06-01 05:57:54
in the input data frame. I'm not sure where the second comes from...
EDIT: It's the new line in the read table that messes it up.
EDIT²: This returns a maximum of 3...
event_count <- function(ref_time){
min(IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow, 3)
}
Upvotes: 2