R: Add index column based on values from other dataframe

Question

I am trying to give coordinate values a label.

I have one data frame with coordinates, which looks similar to

   lat               lng
    1                 10
    3                 11
    2                 12

I have another data of grids and their top, bottom, left and right coordinates.

 id     left    right   bottom      top
 a      0.5      1.5      9          11      
 b      1.5      2.5      9          11  
 c      2.5      3.5     10          11

id columns are the id of each rectangular grid.

I would like to merge the id to the first dataframe, based on the range of the longitude and latitude of the grid.

So the first row of the expected output would be something like

   lat               lng       id
    1                 10        a

How would I achieve this without specifying the range one by one. I was thinking of some kind of subset function, but I'm not sure how to code this.

I would appreciate some help.

r2evans · Accepted Answer

FYI, "latitude" is generally North/South, which to me typically means "top" and "bottom" would make more sense. The code matches your frames, but it seems a little different.

This is a range-based join or a "non-equi" join. This is not natively supported in dplyr, but it can be done in R with at least three other packages:

fuzzyjoin

fuzzyjoin::fuzzy_left_join(
  df1, df2,
  by = c("lat" = "left", "lat" = "right", "lng" = "bottom", "lng" = "top"),
  match_fun = list(`>=`, `<=`, `>=`, `<=`)
)[, c("lat", "lng", "id")]
#   lat lng   id
# 1   1  10    a
# 2   3  11    c
# 3   2  12

(There's also fuzzyjoin::fuzzy_inner_join if you prefer to drop the bottom row here.)

The "verb" functions in fuzzyjoin are intentionally (I believe) named to coincide with the dplyr join functions, so this can be done inline in a dplyr pipe:

library(dplyr)
df1 %>%
  fuzzyjoin::fuzzy_left_join(
    ., df2,
    by = c("lat" = "left", "lat" = "right", "lng" = "bottom", "lng" = "top"),
    match_fun = list(`>=`, `<=`, `>=`, `<=`)
  ) %>%
  select(lat, lng, id)

data.table

library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT2[DT1, on = .(left <= lat, right >= lat, bottom <= lng, top >= lng)][, .(lat=left, lng=top, id)]
#      lat   lng     id
#      
# 1:     1    10      a
# 2:     3    11      c
# 3:     2    12

I should note that the column names post-merge are not necessarily what one might expect: for instance, you won't see lat or lng, because the output column names are taken from the "right" of the merge, i.e. DT2. The merge syntax of DT2[DT1,...] is analogous to left_join(DT1, DT2, ...), so the names are opposite the side one might expect. There is no data.table::[-syntax for inner or full merges, nor for taking the names from the other argument. (In my opinion, a left-merge should use the names from the "left" frame, but there is likely good reason it is what it is.)

sqldf

SQL supports it natively, so if you're using a DBMS somewhere, the queries and join can be done there. Lacking that, the sqldf package gives SQL access to local R objects (using the SQLite engine).

sqldf::sqldf(
  "select df1.lat, df1.lng, df2.id
   from df1
     left join df2 on df1.lat between df2.left and df2.right
                  and df1.lng between df2.bottom and df2.top"
)
#   lat lng   id
# 1   1  10    a
# 2   3  11    c
# 3   2  12

Data

df1 <- structure(list(lat = c(1L, 3L, 2L), lng = 10:12), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(id = c("a", "b", "c"), left = c(0.5, 1.5, 2.5), right = c(1.5, 2.5, 3.5), bottom = c(9L, 9L, 10L), top = c(11L, 11L, 11L)), class = "data.frame", row.names = c(NA, -3L))

R: Add index column based on values from other dataframe

Answers (1)

fuzzyjoin

data.table

sqldf

Related Questions