Identfy breaks in time series and assign unique factor for each break in R

Question

I've got a large df of vessel track time series data per hundreds of vessels. The time series is over multiple years, therefore each vessel has multiple tracks. Each 'track' is hourly data, and there are large gaps (> days) in the time series that I'm hoping to use to identify each individual track per vessel.

My plan is to use a loop to first select the vessel and its whole time series, then identify unique tracks per vessel, then split (into a list) the individual tracks per selected vessel, then do some math, the unsplit, and append to a new data frame of all vessels. I can't work out how to give a unique factor to each identified track for split(). Some simplified data is;

vessel<-c(rep("A",11))
time <- as.POSIXct(c("2017-01-01 00:02:25 GMT", "2017-01-01 01:31:26 GMT", "2017-01-01 02:37:42 GMT",
                     "2017-01-01 03:14:34 GMT", "2017-01-01 04:09:45 GMT", "2017-02-01 05:51:53 GMT",
                     "2017-03-01 06:22:24 GMT", "2017-03-01 07:34:44 GMT","2017-03-01 08:01:15 GMT",
                     "2017-03-01 09:16:44 GMT", "2017-03-01 10:48:12 GMT")) 

df<-data.frame(vessel,time)

You'll see that I've added a single time (row 6) that isn't part of a track -the data is riddled with these single pings that are not part of any track. I'd also like to know how to deal with those occurrences and delete them some how. My code that I've picked up in other posts looks like this so far;

df$gap <- c(0, with(df, time[-1] - time[-nrow(df)]))/60 # results in hours between rows
gap_threshold <- 10 # anything greater that 10 hours difference I treat as a different track
df$over_thresh <- df$gap < gap_threshold
df

This identifies where the breaks are, but how do I then assign a unique factor to each break that I can use in split(df, df$split_factor)? Ideally, final df should look something like, but I don't know how to create the column 'split_factor'?

vessel                time         gap       over_thresh  split_factor
1       A 2017-01-01 00:02:25   0.0000000        TRUE      split_1
2       A 2017-01-01 01:31:26   1.4836111        TRUE      split_1
3       A 2017-01-01 02:37:42   1.1044444        TRUE      split_1
4       A 2017-01-01 03:14:34   0.6144444        TRUE      split_1
5       A 2017-01-01 04:09:45   0.9197222        TRUE      split_1
6       A 2017-02-01 05:51:53 745.7022222       FALSE       delete
7       A 2017-03-01 06:22:24 672.5086111       FALSE      split_2
8       A 2017-03-01 07:34:44   1.2055556        TRUE      split_2
9       A 2017-03-01 08:01:15   0.4419444        TRUE      split_2
10      A 2017-03-01 09:16:44   1.2580556        TRUE      split_2
11      A 2017-03-01 10:48:12   1.5244444        TRUE      split_2
>

The second track starts at row 7 but its been identified as FALSE because of the difference in time from the previous row. However, it needs to be labeled as part of the next track.

And also, this is all being done with spatial data frames, so I assume this can be done but I could be wrong on that. I can extract data, and re-create the spatial data frame, no problem. Thanks.

akrun · Accepted Answer

Here, is one option with data.table. Create a grouping index with rleid based on the 'over_thresh', grouped by 'vessel', then create the 'split_factor' as a column with 'delete' string. Get the index (.I) of rows where there are any TRUE elements in 'over_thresh' grouped by 'vessel', 'grp', use that in i, get the group index (.GRP) and paste the substring split to assign the row elements in i for 'split_factor'

library(data.table)
setDT(df)[, grp := rleid(over_thresh|shift(over_thresh, type = 'lead')), vessel]
df[, split_factor := 'delete']
i1 <- df[, .I[any(over_thresh)], .(vessel, grp)]$V1
df[i1, split_factor := paste0('split_', .GRP), .(vessel, grp)][, grp := NULL][]
#     vessel                time         gap over_thresh split_factor
# 1:      A 2017-01-01 00:02:25   0.0000000        TRUE      split_1
# 2:      A 2017-01-01 01:31:26   1.4836111        TRUE      split_1
# 3:      A 2017-01-01 02:37:42   1.1044444        TRUE      split_1
# 4:      A 2017-01-01 03:14:34   0.6144444        TRUE      split_1
# 5:      A 2017-01-01 04:09:45   0.9197222        TRUE      split_1
# 6:      A 2017-02-01 05:51:53 745.7022222       FALSE       delete
# 7:      A 2017-03-01 06:22:24 672.5086111       FALSE      split_2
# 8:      A 2017-03-01 07:34:44   1.2055556        TRUE      split_2
# 9:      A 2017-03-01 08:01:15   0.4419444        TRUE      split_2
#10:      A 2017-03-01 09:16:44   1.2580556        TRUE      split_2
#11:      A 2017-03-01 10:48:12   1.5244444        TRUE      split_2

Or using dplyr and rle, after grouping by 'vessel', use the rle on the 'over_thresh' OR (|) the lead (i.e. the next value) of 'over_thresh' which return a list of lengths and values. Now, we make a manipulation of the 'values' (which is logical) by assigning the ones that are TRUE with 'split_', sequence of 'values', then change the ones that are FALSE to 'delete'

library(dplyr)
library(stringr)
df %>% 
     group_by(vessel) %>% 
     mutate(split_factor = inverse.rle(within.list(rle(over_thresh|
            lead(over_thresh)),
          values[values] <- str_c('split_', seq_along(values[values])))), 
          split_factor = replace(split_factor, 
             !as.logical(split_factor), 'delete'))
# A tibble: 11 x 5
# Groups:   vessel [1]
#   vessel time                    gap over_thresh split_factor
#                                    
# 1 A      2017-01-01 00:02:25   0     TRUE        split_1     
# 2 A      2017-01-01 01:31:26   1.48  TRUE        split_1     
# 3 A      2017-01-01 02:37:42   1.10  TRUE        split_1     
# 4 A      2017-01-01 03:14:34   0.614 TRUE        split_1     
# 5 A      2017-01-01 04:09:45   0.920 TRUE        split_1     
# 6 A      2017-02-01 05:51:53 746.    FALSE       delete      
# 7 A      2017-03-01 06:22:24 673.    FALSE       split_2     
# 8 A      2017-03-01 07:34:44   1.21  TRUE        split_2     
# 9 A      2017-03-01 08:01:15   0.442 TRUE        split_2     
#10 A      2017-03-01 09:16:44   1.26  TRUE        split_2     
#11 A      2017-03-01 10:48:12   1.52  TRUE        split_2

Identfy breaks in time series and assign unique factor for each break in R

Answers (1)

Related Questions