Reputation: 127
I've got a large df of vessel track time series data per hundreds of vessels. The time series is over multiple years, therefore each vessel has multiple tracks. Each 'track' is hourly data, and there are large gaps (> days) in the time series that I'm hoping to use to identify each individual track per vessel.
My plan is to use a loop to first select the vessel and its whole time series, then identify unique tracks per vessel, then split (into a list) the individual tracks per selected vessel, then do some math, the unsplit, and append to a new data frame of all vessels. I can't work out how to give a unique factor to each identified track for split(). Some simplified data is;
vessel<-c(rep("A",11))
time <- as.POSIXct(c("2017-01-01 00:02:25 GMT", "2017-01-01 01:31:26 GMT", "2017-01-01 02:37:42 GMT",
"2017-01-01 03:14:34 GMT", "2017-01-01 04:09:45 GMT", "2017-02-01 05:51:53 GMT",
"2017-03-01 06:22:24 GMT", "2017-03-01 07:34:44 GMT","2017-03-01 08:01:15 GMT",
"2017-03-01 09:16:44 GMT", "2017-03-01 10:48:12 GMT"))
df<-data.frame(vessel,time)
You'll see that I've added a single time (row 6) that isn't part of a track -the data is riddled with these single pings that are not part of any track. I'd also like to know how to deal with those occurrences and delete them some how. My code that I've picked up in other posts looks like this so far;
df$gap <- c(0, with(df, time[-1] - time[-nrow(df)]))/60 # results in hours between rows
gap_threshold <- 10 # anything greater that 10 hours difference I treat as a different track
df$over_thresh <- df$gap < gap_threshold
df
This identifies where the breaks are, but how do I then assign a unique factor to each break that I can use in split(df, df$split_factor)
? Ideally, final df should look something like, but I don't know how to create the column 'split_factor'?
vessel time gap over_thresh split_factor
1 A 2017-01-01 00:02:25 0.0000000 TRUE split_1
2 A 2017-01-01 01:31:26 1.4836111 TRUE split_1
3 A 2017-01-01 02:37:42 1.1044444 TRUE split_1
4 A 2017-01-01 03:14:34 0.6144444 TRUE split_1
5 A 2017-01-01 04:09:45 0.9197222 TRUE split_1
6 A 2017-02-01 05:51:53 745.7022222 FALSE delete
7 A 2017-03-01 06:22:24 672.5086111 FALSE split_2
8 A 2017-03-01 07:34:44 1.2055556 TRUE split_2
9 A 2017-03-01 08:01:15 0.4419444 TRUE split_2
10 A 2017-03-01 09:16:44 1.2580556 TRUE split_2
11 A 2017-03-01 10:48:12 1.5244444 TRUE split_2
>
The second track starts at row 7 but its been identified as FALSE because of the difference in time from the previous row. However, it needs to be labeled as part of the next track.
And also, this is all being done with spatial data frames, so I assume this can be done but I could be wrong on that. I can extract data, and re-create the spatial data frame, no problem. Thanks.
Upvotes: 1
Views: 165
Reputation: 887223
Here, is one option with data.table
. Create a grouping index with rleid
based on the 'over_thresh', grouped by 'vessel', then create the 'split_factor' as a column with 'delete' string. Get the index (.I
) of rows where there are any
TRUE
elements in 'over_thresh' grouped by 'vessel', 'grp', use that in i
, get the group index (.GRP
) and paste the substring split
to assign the row elements in i
for 'split_factor'
library(data.table)
setDT(df)[, grp := rleid(over_thresh|shift(over_thresh, type = 'lead')), vessel]
df[, split_factor := 'delete']
i1 <- df[, .I[any(over_thresh)], .(vessel, grp)]$V1
df[i1, split_factor := paste0('split_', .GRP), .(vessel, grp)][, grp := NULL][]
# vessel time gap over_thresh split_factor
# 1: A 2017-01-01 00:02:25 0.0000000 TRUE split_1
# 2: A 2017-01-01 01:31:26 1.4836111 TRUE split_1
# 3: A 2017-01-01 02:37:42 1.1044444 TRUE split_1
# 4: A 2017-01-01 03:14:34 0.6144444 TRUE split_1
# 5: A 2017-01-01 04:09:45 0.9197222 TRUE split_1
# 6: A 2017-02-01 05:51:53 745.7022222 FALSE delete
# 7: A 2017-03-01 06:22:24 672.5086111 FALSE split_2
# 8: A 2017-03-01 07:34:44 1.2055556 TRUE split_2
# 9: A 2017-03-01 08:01:15 0.4419444 TRUE split_2
#10: A 2017-03-01 09:16:44 1.2580556 TRUE split_2
#11: A 2017-03-01 10:48:12 1.5244444 TRUE split_2
Or using dplyr
and rle
, after grouping by 'vessel', use the rle
on the 'over_thresh' OR (|
) the lead
(i.e. the next value) of 'over_thresh' which return a list
of lengths
and values
. Now, we make a manipulation of the 'values' (which is logical
) by assigning the ones that are TRUE with 'split_', sequence of 'values', then change the ones that are FALSE
to 'delete'
library(dplyr)
library(stringr)
df %>%
group_by(vessel) %>%
mutate(split_factor = inverse.rle(within.list(rle(over_thresh|
lead(over_thresh)),
values[values] <- str_c('split_', seq_along(values[values])))),
split_factor = replace(split_factor,
!as.logical(split_factor), 'delete'))
# A tibble: 11 x 5
# Groups: vessel [1]
# vessel time gap over_thresh split_factor
# <chr> <dttm> <dbl> <lgl> <chr>
# 1 A 2017-01-01 00:02:25 0 TRUE split_1
# 2 A 2017-01-01 01:31:26 1.48 TRUE split_1
# 3 A 2017-01-01 02:37:42 1.10 TRUE split_1
# 4 A 2017-01-01 03:14:34 0.614 TRUE split_1
# 5 A 2017-01-01 04:09:45 0.920 TRUE split_1
# 6 A 2017-02-01 05:51:53 746. FALSE delete
# 7 A 2017-03-01 06:22:24 673. FALSE split_2
# 8 A 2017-03-01 07:34:44 1.21 TRUE split_2
# 9 A 2017-03-01 08:01:15 0.442 TRUE split_2
#10 A 2017-03-01 09:16:44 1.26 TRUE split_2
#11 A 2017-03-01 10:48:12 1.52 TRUE split_2
Upvotes: 2