stefano
stefano

Reputation: 415

Identify events within a time window in R

I need to identify a series (maximum 3 events) of events that occurred within 60 seconds.

Here there is the IN data

IN<-read.table(header = FALSE, text = "
2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")

IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")

and here there is the desired output

OUT<-read.table(header = FALSE, text = "
2018-06-01_04:29:47        1
2018-06-01_05:44:41        1
2018-06-01_05:44:43        2
2018-06-01_05:44:45        3
2018-06-01_05:57:54        1
2018-06-01_05:57:56        2
2018-06-01_05:57:58        3
2018-06-01_08:10:35        1
2018-06-01_08:41:20        1
2018-06-01_08:41:22        2
2018-06-01_08:41:24        3
2018-06-01_08:52:01        1
2018-06-01_09:02:13        1
2018-06-01_09:22:45        1
",quote="\n",col.names=c("time","response"))

I have searched for similar questions, but unsuccessfully. I guess that function diff is the first step for solving this problem,

response<-as.numeric(diff(IN$time)>60)

but than I have no idea how to proceed to get the desired output.

Any helps will be appreciated.

Upvotes: 0

Views: 973

Answers (2)

zack
zack

Reputation: 5405

Here's a data frame with some edge cases:

IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
           2018-06-01_05:44:41
           2018-06-01_05:44:43
           2018-06-01_05:44:45
           2018-06-01_05:44:47
           2018-06-01_05:57:54
           2018-06-01_05:57:56
           2018-06-01_05:57:58
           2018-06-01_05:58:56
           2018-06-01_08:10:35
           2018-06-01_08:41:20
           2018-06-01_08:41:22
           2018-06-01_08:41:24
           2018-06-01_08:52:01
           2018-06-01_09:02:13
           2018-06-01_09:22:45", quote="\n",col.names="time")

IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")

IN
                  time
1  2018-06-01 04:29:47
2  2018-06-01 05:44:41
3  2018-06-01 05:44:43
4  2018-06-01 05:44:45
5  2018-06-01 05:44:47
6  2018-06-01 05:57:54
7  2018-06-01 05:57:56
8  2018-06-01 05:57:58
9  2018-06-01 05:58:56
10 2018-06-01 08:10:35
11 2018-06-01 08:41:20
12 2018-06-01 08:41:22
13 2018-06-01 08:41:24
14 2018-06-01 08:52:01
15 2018-06-01 09:02:13
16 2018-06-01 09:22:45

You'll notice line 9 is a minute after the mid-group time but not the reference time. Line 5 is also the 4th member of what would be a group if there were no limits imposed.

Here's my solution using dplyr. I think it works generally speaking:

res <- IN %>% mutate(diffs = as.numeric(time - lag(time)),
                     helper1 = case_when(is.na(diffs) ~ 1,
                                         diffs <= 60 ~ 0 ,
                                         TRUE ~ 1),
                     grouper1 = cumsum(helper1)) %>%
  group_by(grouper1) %>%
  mutate(helper2 = cumsum(diffs) - first(diffs),
         helper3 = helper2 %/% 60,
         helper4 = helper1 + if_else(is.na(helper3), 0, helper3)) %>%
  ungroup() %>%
  mutate(grouper2 = cumsum(helper4)) %>%
  group_by(grouper2) %>%
  mutate(rn0 = row_number() - 1,
         grouper3 = rn0 %/% 3) %>%
  group_by(grouper2, grouper3) %>%
  mutate(count = row_number()) %>%
  ungroup() %>%
  select(time, count)

the result:

> res
# A tibble: 16 x 2
   time                count
   <dttm>              <int>
 1 2018-06-01 04:29:47     1
 2 2018-06-01 05:44:41     1
 3 2018-06-01 05:44:43     2
 4 2018-06-01 05:44:45     3
 5 2018-06-01 05:44:47     1
 6 2018-06-01 05:57:54     1
 7 2018-06-01 05:57:56     2
 8 2018-06-01 05:57:58     3
 9 2018-06-01 05:58:56     1
10 2018-06-01 08:10:35     1
11 2018-06-01 08:41:20     1
12 2018-06-01 08:41:22     2
13 2018-06-01 08:41:24     3
14 2018-06-01 08:52:01     1
15 2018-06-01 09:02:13     1
16 2018-06-01 09:22:45     1

I think i structured the dplyr calls in a way where you can follow them, but if you have questions feel free to post in comments.

Upvotes: 1

Dan
Dan

Reputation: 12074

Here's a solution using dplyr, magrittr, and lubridate packages.

IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
               2018-06-01_05:44:41
               2018-06-01_05:44:43
               2018-06-01_05:44:45
               2018-06-01_05:57:54
               2018-06-01_05:57:56
               2018-06-01_05:57:58
               2018-06-01_08:10:35
               2018-06-01_08:41:20
               2018-06-01_08:41:22
               2018-06-01_08:41:24
               2018-06-01_08:52:01
               2018-06-01_09:02:13
               2018-06-01_09:22:45", quote="\n",col.names="time")

IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")

I've removed the blank first line of the input data frame, as it caused problems. The following function filters the data frame to those elements within 60 seconds before the given ref_time and counts the number of rows using nrow.

event_count <- function(ref_time){
  IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow
}

Here, I apply the function in a row-wise fashion, record the counts, and sort according to time. (Probably unnecessary...) The results are piped back in to the input data frame using the compound assignment pipe from magrittr.

IN %<>% 
  rowwise() %>% 
  mutate(counts = event_count(time)) %>% 
  arrange(time)

Finally, the results.

# A tibble: 14 x 2
#    time                counts
#    <dttm>               <int>
# 1  2018-06-01 04:29:47      1
# 2  2018-06-01 05:44:41      1
# 3  2018-06-01 05:44:43      2
# 4  2018-06-01 05:44:45      3
# 5  2018-06-01 05:57:54      1
# 6  2018-06-01 05:57:56      2
# 7  2018-06-01 05:57:58      3
# 8  2018-06-01 08:10:35      1
# 9  2018-06-01 08:41:20      1
# 10 2018-06-01 08:41:22      2
# 11 2018-06-01 08:41:24      3
# 12 2018-06-01 08:52:01      1
# 13 2018-06-01 09:02:13      1
# 14 2018-06-01 09:22:45      1

I think what @PoGibas is alluding to is for some reason there are two entries with the time 2018-06-01 05:57:54 in the input data frame. I'm not sure where the second comes from...


EDIT: It's the new line in the read table that messes it up.

EDIT²: This returns a maximum of 3...

event_count <- function(ref_time){
  min(IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow, 3)
}

Upvotes: 2

Related Questions