Reputation: 127
I've asked this question previously (Identfy breaks in time series and assign unique factor for each break in R) and the solution suggested worked at the time, but I now see it does not work in all cases. I've been trialing variations of that solution but have had no luck. Here is the simplified question again:
I have a date-time series of vessel positions, with large gaps in the time series. Gaps represent breaks in the vessel track. I want to add a unique identifier to each track. Here is some real data;
time<-c("2019-01-23 00:33:58 GMT", "2019-01-23 12:10:27 GMT", "2019-01-23 13:49:29 GMT",
"2019-01-23 15:29:27 GMT", "2019-01-23 18:35:27 GMT", "2019-01-23 21:38:29 GMT",
"2019-01-28 14:52:10 GMT", "2019-01-28 16:31:37 GMT", "2019-01-28 18:07:40 GMT",
"2019-01-28 19:46:40 GMT", "2019-01-28 21:22:40 GMT", "2019-01-29 13:53:48 GMT",
"2019-01-29 15:25:48 GMT", "2019-01-29 18:43:54 GMT", "2019-01-29 20:19:56 GMT",
"2019-01-29 21:56:27 GMT", "2019-02-03 14:57:01 GMT", "2019-02-04 15:45:56 GMT",
"2019-02-04 16:49:57 GMT", "2019-02-05 17:46:05 GMT", "2019-02-05 18:03:06 GMT",
"2019-02-08 16:05:59 GMT", "2019-02-08 17:42:59 GMT", "2019-02-09 02:47:00 GMT")
I define a 'track' as consecutive points that have less than 12 hours difference between them, and anything greater than 12 hours is considered part of the next track. Using the following code finds the breaks and uniquely labels each track, as well as places 'delete' in the one-off points that can't be associated with a track.
library(dplyr)
library(stringr)
vessel<-rep(1, length(time))
df<-data.frame(vessel,time)
df$gap <- round(c(0, difftime(time[-1],time[-nrow(df)], units = "hours")),2)
df$within_thresh <- df$gap < 12 #12 hours difference
df %>%
mutate(split_factor = inverse.rle(within.list(rle(within_thresh),
values[values] <- str_c('track_', seq_along(values[values])))),
split_factor = replace(split_factor, !as.logical(split_factor), 'delete'))
The issue is that its placing a 'delete' next to a time that is actually the start of the next track e.g. see line 7 should read 'track_2', or line 12 should read 'track_3' in the results here. A genuine point I want to 'delete' is the one-off position at line 17.
vessel time gap within_thresh split_factor
1 1 2019-01-23 00:33:58 GMT 0.00 TRUE track_1
2 1 2019-01-23 12:10:27 GMT 11.61 TRUE track_1
3 1 2019-01-23 13:49:29 GMT 1.65 TRUE track_1
4 1 2019-01-23 15:29:27 GMT 1.67 TRUE track_1
5 1 2019-01-23 18:35:27 GMT 3.10 TRUE track_1
6 1 2019-01-23 21:38:29 GMT 3.05 TRUE track_1
7 1 2019-01-28 14:52:10 GMT 113.23 FALSE delete # actually track_2
8 1 2019-01-28 16:31:37 GMT 1.66 TRUE track_2
9 1 2019-01-28 18:07:40 GMT 1.60 TRUE track_2
10 1 2019-01-28 19:46:40 GMT 1.65 TRUE track_2
11 1 2019-01-28 21:22:40 GMT 1.60 TRUE track_2
12 1 2019-01-29 13:53:48 GMT 16.52 FALSE delete # actually track_3
13 1 2019-01-29 15:25:48 GMT 1.53 TRUE track_3
14 1 2019-01-29 18:43:54 GMT 3.30 TRUE track_3
15 1 2019-01-29 20:19:56 GMT 1.60 TRUE track_3
16 1 2019-01-29 21:56:27 GMT 1.61 TRUE track_3
17 1 2019-02-03 14:57:01 GMT 113.01 FALSE delete # correct, as is a one-off
18 1 2019-02-04 15:45:56 GMT 24.82 FALSE delete # actually track_4
19 1 2019-02-04 16:49:57 GMT 1.07 TRUE track_4
20 1 2019-02-05 17:46:05 GMT 24.94 FALSE delete # actually track_5
21 1 2019-02-05 18:03:06 GMT 0.28 TRUE track_5
22 1 2019-02-08 16:05:59 GMT 70.05 FALSE delete # actually track_6
23 1 2019-02-08 17:42:59 GMT 1.62 TRUE track_6
24 1 2019-02-09 02:47:00 GMT 9.07 TRUE track_6
The original user suggestion included an | (or) statement (rle(within_thresh|lead(within_thresh)
but this only identifies 2 tracks in this example.
Any suggestions welcome.
Upvotes: 0
Views: 34
Reputation: 3258
How about:
with_id <- df %>%
mutate(track_id = if_else(gap > 12,1,0),
track_id = cumsum(track_id)) %>%
group_by(track_id) %>%
filter(n()>1)
If you want the track_id to be consecutive and start at one, you can finish with:
with_id$track_id <- with_id %>% group_indices()
The basic idea is to mark 1 where ever we start a new track. Then cumsum
will label the rows which are not the starting point.
Upvotes: 1
Reputation: 5747
You are getting this behavior because you are only checking the time period before each time and not after. Here is a work around with many more conditional checks, but returns the correct tracks that break at 12 hour intervals and identifies the deleted point.
df %>%
mutate(before = round(as.numeric(difftime(time, lag(time), units = "hours")), 2),
after = round(as.numeric(difftime(lead(time), time, units = "hours")), 2)) %>%
replace(is.na(.), 0) %>%
mutate(before = inverse.rle(within.list(rle(before < 12),
values[values] <- str_c('track_',
seq_along(values[values])))),
after = inverse.rle(within.list(rle(after < 12),
values[values] <- str_c('track_',
seq_along(values[values]))))) %>%
mutate(split_factor = case_when(before == "FALSE" ~ if_else(after == "FALSE", "delete", after),
TRUE ~ before)) %>%
select(vessel, time, split_factor)
vessel time split_factor
1 1 2019-01-23 00:33:58 track_1
2 1 2019-01-23 12:10:27 track_1
3 1 2019-01-23 13:49:29 track_1
4 1 2019-01-23 15:29:27 track_1
5 1 2019-01-23 18:35:27 track_1
6 1 2019-01-23 21:38:29 track_1
7 1 2019-01-28 14:52:10 track_2
8 1 2019-01-28 16:31:37 track_2
9 1 2019-01-28 18:07:40 track_2
10 1 2019-01-28 19:46:40 track_2
11 1 2019-01-28 21:22:40 track_2
12 1 2019-01-29 13:53:48 track_3
13 1 2019-01-29 15:25:48 track_3
14 1 2019-01-29 18:43:54 track_3
15 1 2019-01-29 20:19:56 track_3
16 1 2019-01-29 21:56:27 track_3
17 1 2019-02-03 14:57:01 delete
18 1 2019-02-04 15:45:56 track_4
19 1 2019-02-04 16:49:57 track_4
20 1 2019-02-05 17:46:05 track_5
21 1 2019-02-05 18:03:06 track_5
22 1 2019-02-08 16:05:59 track_6
23 1 2019-02-08 17:42:59 track_6
24 1 2019-02-09 02:47:00 track_6
Upvotes: 1