interpolation/lookup in R

Question

I'm switching to R from excel and was wondering how to do this in R.
I have a dataset that looks something like this:

df1<-data.frame(Zipcode=c("7941AH","7941AG","7941AH","7941AZ"),
                From=c(2,30,45,1),
                To=c(20,38,57,8),
                Type=c("even","mixed","odd","mixed"),
                GPS=c(12345,54321,11221,22331)) 

df2<-data.frame(zipcode=c("7914AH", "7914AH", "7914AH", "7914AG","7914AG","7914AZ"), 
                housenum=c(18, 19, 50, 32, 104,11))

First dataset contains zipcode, house number range (from and to), type meaning if the range contains even, odd or mixed house numbers and gps coordinates. Second dataset contains only address (zipcode, house number).

What I want to do is to lookup gps coordinates for df2. For example address with zipcode 7941AG and housenumber 18 (even number between 2 and 20) has gps coordinate 12345.

Update: As it didn't cross my mind that the size of the dataset is important for the chosen solution (I know, bit naive...) here some extra information: Actual size of df1 is 472.000 observations and df2 has 1.1 million observations. The number of unique zipcodes in df1 is 280.000. I stumbled upon this post speed up the loop operation in R with some interesting findings, but I don't know how to incorporate this in the solution provided by @josilber

josliber · Accepted Answer

Given large data frames, your best bet may be to merge df1 and df2 by their zip codes (aka get every pair of rows from the data frames where they have the same zip code), filter by the house number criteria, remove duplicates (cases where multiple rules from df1 match), and then store the information about all the matched houses. Let's start with a sample dataset of the size you indicated:

set.seed(144)
df1 <- data.frame(Zipcode=sample(1:280000, 472000, replace=TRUE),
                  From=sample(1:50, 472000, replace=TRUE),
                  To=sample(51:100, 472000, replace=TRUE),
                  Type=sample(c("even", "odd", "mixed"), 472000, replace=TRUE),
                  GPS=sample(1:100, 472000, replace=TRUE))
df2 <- data.frame(zipcode=sample(1:280000, 1.1e6, replace=TRUE),
                  housenum=sample(1:100, 1.1e6, replace=TRUE))

Now we can perform the efficient computation of the GPS data:

get.gps <- function(df1, df2) {
  # Add ID to df2
  df2$id <- 1:nrow(df2)
  m <- merge(df1, df2, by.x="Zipcode", by.y="zipcode")
  m <- m[m$housenum >= m$From &
         m$housenum <= m$To &
         (m$Type == "mixed" |
          (m$Type == "odd" & m$housenum %% 2 == 1) |
          (m$Type == "even" & m$housenum %% 2 == 0)),]
  m <- m[!duplicated(m$id) & !duplicated(m$id, fromLast=TRUE),]
  GPS <- rep(NA, nrow(df2))
  GPS[m$id] <- m$GPS
  return(GPS)
}
system.time(get.gps(df1, df2))
#    user  system elapsed 
#  16.197   0.561  17.583

This is a much more acceptable runtime -- 18 seconds instead of the 90 hours that you estimated in the comment of my other answer!

interpolation/lookup in R

Answers (2)

Related Questions