Johan Vos
Johan Vos

Reputation: 65

In R test if case is within a range of times and a unique ID

I have two data.frames with information.

The first one is the biggest, and contains a lot of rows for unique speed measurements in curves. So it has columns with a curveID, a datetime of the measurement, and a lot of columns with other speed and curve information.

The second one is a small database, containing known roadworks. It contains three columns: curveID, from and to. curveID has the same factors as the first database, from is a datetime (POSIXct) is the start roadworks on that roadworks on that section and to is a datetime (POSIXct) containing the end of roadworks.

Quite simplified the databases look like this:

speedmeasurements <- data.frame("curve_id" = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3), "datetime" = c(0, 2, 8, 9, 1, 2, 3, 4, 5, 1, 3, 5))
roadworks <- data.frame("curve_id" = c(1, 1, 3), "from" = c(1, 5, 2), "to" = c(3, 7, 4))

I've added a column to create intervals, like this roadworks$range <- interval(roadworks$from, roadworks$to)

Now I want to add a column roadworks to my first database. This should be a logical value, which checks whether or not a speed measurement was done during roadworks. So, I need a bit of code which checks if the combination of curveID and datetime of the speed measurement are within roadworks timeslots.

In the simpified example, I would like to get this result:

speedmeasurements <- data.frame("curve_id" = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3), "datetime" = c(0, 2, 8, 9, 1, 2, 3, 4, 5, 1, 3, 5), "roadworks" = c(F, T, F, F, F, F, F, F, F, F, T, F))

I've been thinking of an ifelse() like this: speedmeasurements$roadworks<- ifelse(speedmeasurements$curve_id == roadworks$curve_id & speedmeasurements$datetime %within% roadworks$range, T, F), but this seems to fail due to different object lengths.

Does anyone have a way forward? Perhaps a data.table solution, but I'm quite novice in that area.

Kind regards,

Johan

Upvotes: 0

Views: 112

Answers (1)

r2evans
r2evans

Reputation: 160687

Since you want to do with by curve_id, it suggests a join operation. Unfortunately, base R's merge and dplyr's *_join functions do not do inequality, so you'd be stuck with splitting by curve_id, doing some manual labor on each, and then rebinding them.

data.table allows "in-equi" joins (join on a range) directly.

I'll start by adding in the "roadworks" column so that we see something from the join.

library(data.table)
setDT(speedmeasurements)
setDT(roadworks)

roadworks[,roadworks := TRUE]
roadworks
#    curve_id from to roadworks
# 1:        1    1  3      TRUE
# 2:        1    5  7      TRUE
# 3:        3    2  4      TRUE

And now the merge:

roadworks[ speedmeasurements, on = .(curve_id, from <= datetime, to >= datetime) ]
#     curve_id from to roadworks
#  1:        1    0  0        NA
#  2:        1    2  2      TRUE
#  3:        1    8  8        NA
#  4:        1    9  9        NA
#  5:        2    1  1        NA
#  6:        2    2  2        NA
#  7:        2    3  3        NA
#  8:        2    4  4        NA
#  9:        2    5  5        NA
# 10:        3    1  1        NA
# 11:        3    3  3      TRUE
# 12:        3    5  5        NA

A few things from this:

Full code with cleanup:

speed2 <- roadworks[ speedmeasurements, on = .(curve_id, from <= datetime, to >= datetime)
                    ][, to := NULL ][, roadworks := !is.na(roadworks) ]
setnames(speed2, "from", "datetime")
speed2
#     curve_id datetime roadworks
#  1:        1        0     FALSE
#  2:        1        2      TRUE
#  3:        1        8     FALSE
#  4:        1        9     FALSE
#  5:        2        1     FALSE
#  6:        2        2     FALSE
#  7:        2        3     FALSE
#  8:        2        4     FALSE
#  9:        2        5     FALSE
# 10:        3        1     FALSE
# 11:        3        3      TRUE
# 12:        3        5     FALSE

We can reduce the lines of code slightly with

speed2 <- roadworks[ speedmeasurements, on = .(curve_id, from <= datetime, to >= datetime)
                    ][, c("datetime", "from", "to") := .(from, NULL, NULL) ][, roadworks := !is.na(roadworks) ]

For readability, I often use magrittr::%>% with data.table (though most often associate it with dplyr), resulting in this:

library(magrittr)
speed2 <- roadworks[ speedmeasurements, on = .(curve_id, from <= datetime, to >= datetime) ] %>%
  .[, c("datetime", "from", "to") := .(from, NULL, NULL) ] %>%
  .[, roadworks := !is.na(roadworks) ]

(Note that data.table has a long-standing bug where if the last operation is an := assignment, see https://rdatatable.gitlab.io/data.table/articles/datatable-faq.html#why-do-i-have-to-type-dt-sometimes-twice-after-using-to-print-the-result-to-console)

Upvotes: 1

Related Questions