Reputation: 65
I have two data.frames with information.
The first one is the biggest, and contains a lot of rows for unique speed measurements in curves. So it has columns with a curveID, a datetime of the measurement, and a lot of columns with other speed and curve information.
The second one is a small database, containing known roadworks. It contains three columns: curveID, from and to. curveID has the same factors as the first database, from is a datetime (POSIXct) is the start roadworks on that roadworks on that section and to is a datetime (POSIXct) containing the end of roadworks.
Quite simplified the databases look like this:
speedmeasurements <- data.frame("curve_id" = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3), "datetime" = c(0, 2, 8, 9, 1, 2, 3, 4, 5, 1, 3, 5))
roadworks <- data.frame("curve_id" = c(1, 1, 3), "from" = c(1, 5, 2), "to" = c(3, 7, 4))
I've added a column to create intervals, like this roadworks$range <- interval(roadworks$from, roadworks$to)
Now I want to add a column roadworks to my first database. This should be a logical value, which checks whether or not a speed measurement was done during roadworks. So, I need a bit of code which checks if the combination of curveID and datetime of the speed measurement are within roadworks timeslots.
In the simpified example, I would like to get this result:
speedmeasurements <- data.frame("curve_id" = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3), "datetime" = c(0, 2, 8, 9, 1, 2, 3, 4, 5, 1, 3, 5), "roadworks" = c(F, T, F, F, F, F, F, F, F, F, T, F))
I've been thinking of an ifelse()
like this:
speedmeasurements$roadworks<- ifelse(speedmeasurements$curve_id == roadworks$curve_id & speedmeasurements$datetime %within% roadworks$range, T, F)
, but this seems to fail due to different object lengths.
Does anyone have a way forward? Perhaps a data.table
solution, but I'm quite novice in that area.
Kind regards,
Johan
Upvotes: 0
Views: 112
Reputation: 160687
Since you want to do with by curve_id
, it suggests a join operation. Unfortunately, base R's merge
and dplyr
's *_join
functions do not do inequality, so you'd be stuck with splitting by curve_id
, doing some manual labor on each, and then rebinding them.
data.table
allows "in-equi" joins (join on a range) directly.
I'll start by adding in the "roadworks" column so that we see something from the join.
library(data.table)
setDT(speedmeasurements)
setDT(roadworks)
roadworks[,roadworks := TRUE]
roadworks
# curve_id from to roadworks
# 1: 1 1 3 TRUE
# 2: 1 5 7 TRUE
# 3: 3 2 4 TRUE
And now the merge:
roadworks[ speedmeasurements, on = .(curve_id, from <= datetime, to >= datetime) ]
# curve_id from to roadworks
# 1: 1 0 0 NA
# 2: 1 2 2 TRUE
# 3: 1 8 8 NA
# 4: 1 9 9 NA
# 5: 2 1 1 NA
# 6: 2 2 2 NA
# 7: 2 3 3 NA
# 8: 2 4 4 NA
# 9: 2 5 5 NA
# 10: 3 1 1 NA
# 11: 3 3 3 TRUE
# 12: 3 5 5 NA
A few things from this:
data.table
's merge premise is that A[B, on=.(...)]
is effectively "B left join A" (see How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272);A
, not B
, so we will want to rename themNA
, which is expected.Full code with cleanup:
speed2 <- roadworks[ speedmeasurements, on = .(curve_id, from <= datetime, to >= datetime)
][, to := NULL ][, roadworks := !is.na(roadworks) ]
setnames(speed2, "from", "datetime")
speed2
# curve_id datetime roadworks
# 1: 1 0 FALSE
# 2: 1 2 TRUE
# 3: 1 8 FALSE
# 4: 1 9 FALSE
# 5: 2 1 FALSE
# 6: 2 2 FALSE
# 7: 2 3 FALSE
# 8: 2 4 FALSE
# 9: 2 5 FALSE
# 10: 3 1 FALSE
# 11: 3 3 TRUE
# 12: 3 5 FALSE
We can reduce the lines of code slightly with
speed2 <- roadworks[ speedmeasurements, on = .(curve_id, from <= datetime, to >= datetime)
][, c("datetime", "from", "to") := .(from, NULL, NULL) ][, roadworks := !is.na(roadworks) ]
For readability, I often use magrittr::%>%
with data.table
(though most often associate it with dplyr
), resulting in this:
library(magrittr)
speed2 <- roadworks[ speedmeasurements, on = .(curve_id, from <= datetime, to >= datetime) ] %>%
.[, c("datetime", "from", "to") := .(from, NULL, NULL) ] %>%
.[, roadworks := !is.na(roadworks) ]
(Note that data.table
has a long-standing bug where if the last operation is an :=
assignment, see https://rdatatable.gitlab.io/data.table/articles/datatable-faq.html#why-do-i-have-to-type-dt-sometimes-twice-after-using-to-print-the-result-to-console)
Upvotes: 1