user2175481
user2175481

Reputation: 127

Updated but still not working - Identfy breaks in time series and assign unique factor for each break in R

I've asked this question previously (Identfy breaks in time series and assign unique factor for each break in R) and the solution suggested worked at the time, but I now see it does not work in all cases. I've been trialing variations of that solution but have had no luck. Here is the simplified question again:

I have a date-time series of vessel positions, with large gaps in the time series. Gaps represent breaks in the vessel track. I want to add a unique identifier to each track. Here is some real data;

time<-c("2019-01-23 00:33:58 GMT", "2019-01-23 12:10:27 GMT", "2019-01-23 13:49:29 GMT", 
        "2019-01-23 15:29:27 GMT", "2019-01-23 18:35:27 GMT", "2019-01-23 21:38:29 GMT",
        "2019-01-28 14:52:10 GMT", "2019-01-28 16:31:37 GMT", "2019-01-28 18:07:40 GMT",
        "2019-01-28 19:46:40 GMT", "2019-01-28 21:22:40 GMT", "2019-01-29 13:53:48 GMT",
        "2019-01-29 15:25:48 GMT", "2019-01-29 18:43:54 GMT", "2019-01-29 20:19:56 GMT",
        "2019-01-29 21:56:27 GMT", "2019-02-03 14:57:01 GMT", "2019-02-04 15:45:56 GMT",
        "2019-02-04 16:49:57 GMT", "2019-02-05 17:46:05 GMT", "2019-02-05 18:03:06 GMT",
        "2019-02-08 16:05:59 GMT", "2019-02-08 17:42:59 GMT", "2019-02-09 02:47:00 GMT")

I define a 'track' as consecutive points that have less than 12 hours difference between them, and anything greater than 12 hours is considered part of the next track. Using the following code finds the breaks and uniquely labels each track, as well as places 'delete' in the one-off points that can't be associated with a track.

library(dplyr)
library(stringr)
vessel<-rep(1, length(time))
df<-data.frame(vessel,time)
df$gap <- round(c(0, difftime(time[-1],time[-nrow(df)], units = "hours")),2)
df$within_thresh <- df$gap < 12 #12 hours difference

df %>% 
  mutate(split_factor = inverse.rle(within.list(rle(within_thresh),
         values[values] <- str_c('track_', seq_along(values[values])))), 
         split_factor = replace(split_factor, !as.logical(split_factor), 'delete'))

The issue is that its placing a 'delete' next to a time that is actually the start of the next track e.g. see line 7 should read 'track_2', or line 12 should read 'track_3' in the results here. A genuine point I want to 'delete' is the one-off position at line 17.

 vessel                    time    gap within_thresh split_factor
1       1 2019-01-23 00:33:58 GMT   0.00          TRUE      track_1
2       1 2019-01-23 12:10:27 GMT  11.61          TRUE      track_1
3       1 2019-01-23 13:49:29 GMT   1.65          TRUE      track_1
4       1 2019-01-23 15:29:27 GMT   1.67          TRUE      track_1
5       1 2019-01-23 18:35:27 GMT   3.10          TRUE      track_1
6       1 2019-01-23 21:38:29 GMT   3.05          TRUE      track_1
7       1 2019-01-28 14:52:10 GMT 113.23         FALSE       delete # actually track_2
8       1 2019-01-28 16:31:37 GMT   1.66          TRUE      track_2
9       1 2019-01-28 18:07:40 GMT   1.60          TRUE      track_2
10      1 2019-01-28 19:46:40 GMT   1.65          TRUE      track_2
11      1 2019-01-28 21:22:40 GMT   1.60          TRUE      track_2
12      1 2019-01-29 13:53:48 GMT  16.52         FALSE       delete # actually track_3
13      1 2019-01-29 15:25:48 GMT   1.53          TRUE      track_3
14      1 2019-01-29 18:43:54 GMT   3.30          TRUE      track_3
15      1 2019-01-29 20:19:56 GMT   1.60          TRUE      track_3
16      1 2019-01-29 21:56:27 GMT   1.61          TRUE      track_3
17      1 2019-02-03 14:57:01 GMT 113.01         FALSE       delete # correct, as is a one-off
18      1 2019-02-04 15:45:56 GMT  24.82         FALSE       delete # actually track_4
19      1 2019-02-04 16:49:57 GMT   1.07          TRUE      track_4
20      1 2019-02-05 17:46:05 GMT  24.94         FALSE       delete # actually track_5
21      1 2019-02-05 18:03:06 GMT   0.28          TRUE      track_5
22      1 2019-02-08 16:05:59 GMT  70.05         FALSE       delete # actually track_6
23      1 2019-02-08 17:42:59 GMT   1.62          TRUE      track_6
24      1 2019-02-09 02:47:00 GMT   9.07          TRUE      track_6

The original user suggestion included an | (or) statement (rle(within_thresh|lead(within_thresh) but this only identifies 2 tracks in this example.

Any suggestions welcome.

Upvotes: 0

Views: 34

Answers (2)

Michael Dewar
Michael Dewar

Reputation: 3258

How about:

with_id <- df %>% 
  mutate(track_id = if_else(gap > 12,1,0),
         track_id = cumsum(track_id)) %>% 
  group_by(track_id) %>% 
  filter(n()>1)

If you want the track_id to be consecutive and start at one, you can finish with:

with_id$track_id <- with_id %>% group_indices()

The basic idea is to mark 1 where ever we start a new track. Then cumsum will label the rows which are not the starting point.

Upvotes: 1

Ben Norris
Ben Norris

Reputation: 5747

You are getting this behavior because you are only checking the time period before each time and not after. Here is a work around with many more conditional checks, but returns the correct tracks that break at 12 hour intervals and identifies the deleted point.

df %>%
  mutate(before = round(as.numeric(difftime(time, lag(time), units = "hours")), 2),
         after = round(as.numeric(difftime(lead(time), time, units = "hours")), 2)) %>%
  replace(is.na(.), 0) %>%
  mutate(before = inverse.rle(within.list(rle(before < 12), 
                                          values[values] <- str_c('track_', 
                                                                  seq_along(values[values])))), 
         after = inverse.rle(within.list(rle(after < 12), 
                                         values[values] <- str_c('track_', 
                                                                 seq_along(values[values]))))) %>%
  mutate(split_factor = case_when(before == "FALSE" ~ if_else(after == "FALSE", "delete", after),
                                  TRUE ~ before)) %>%
  select(vessel, time, split_factor)

   vessel                time split_factor
1       1 2019-01-23 00:33:58      track_1
2       1 2019-01-23 12:10:27      track_1
3       1 2019-01-23 13:49:29      track_1
4       1 2019-01-23 15:29:27      track_1
5       1 2019-01-23 18:35:27      track_1
6       1 2019-01-23 21:38:29      track_1
7       1 2019-01-28 14:52:10      track_2
8       1 2019-01-28 16:31:37      track_2
9       1 2019-01-28 18:07:40      track_2
10      1 2019-01-28 19:46:40      track_2
11      1 2019-01-28 21:22:40      track_2
12      1 2019-01-29 13:53:48      track_3
13      1 2019-01-29 15:25:48      track_3
14      1 2019-01-29 18:43:54      track_3
15      1 2019-01-29 20:19:56      track_3
16      1 2019-01-29 21:56:27      track_3
17      1 2019-02-03 14:57:01       delete
18      1 2019-02-04 15:45:56      track_4
19      1 2019-02-04 16:49:57      track_4
20      1 2019-02-05 17:46:05      track_5
21      1 2019-02-05 18:03:06      track_5
22      1 2019-02-08 16:05:59      track_6
23      1 2019-02-08 17:42:59      track_6
24      1 2019-02-09 02:47:00      track_6

Upvotes: 1

Related Questions