nsnvc
nsnvc

Reputation: 43

interpolation/lookup in R

I'm switching to R from excel and was wondering how to do this in R.
I have a dataset that looks something like this:

df1<-data.frame(Zipcode=c("7941AH","7941AG","7941AH","7941AZ"),
                From=c(2,30,45,1),
                To=c(20,38,57,8),
                Type=c("even","mixed","odd","mixed"),
                GPS=c(12345,54321,11221,22331)) 

df2<-data.frame(zipcode=c("7914AH", "7914AH", "7914AH", "7914AG","7914AG","7914AZ"), 
                housenum=c(18, 19, 50, 32, 104,11))

First dataset contains zipcode, house number range (from and to), type meaning if the range contains even, odd or mixed house numbers and gps coordinates. Second dataset contains only address (zipcode, house number).

What I want to do is to lookup gps coordinates for df2. For example address with zipcode 7941AG and housenumber 18 (even number between 2 and 20) has gps coordinate 12345.

Update: As it didn't cross my mind that the size of the dataset is important for the chosen solution (I know, bit naive...) here some extra information: Actual size of df1 is 472.000 observations and df2 has 1.1 million observations. The number of unique zipcodes in df1 is 280.000. I stumbled upon this post speed up the loop operation in R with some interesting findings, but I don't know how to incorporate this in the solution provided by @josilber

Upvotes: 3

Views: 218

Answers (2)

josliber
josliber

Reputation: 44340

Given large data frames, your best bet may be to merge df1 and df2 by their zip codes (aka get every pair of rows from the data frames where they have the same zip code), filter by the house number criteria, remove duplicates (cases where multiple rules from df1 match), and then store the information about all the matched houses. Let's start with a sample dataset of the size you indicated:

set.seed(144)
df1 <- data.frame(Zipcode=sample(1:280000, 472000, replace=TRUE),
                  From=sample(1:50, 472000, replace=TRUE),
                  To=sample(51:100, 472000, replace=TRUE),
                  Type=sample(c("even", "odd", "mixed"), 472000, replace=TRUE),
                  GPS=sample(1:100, 472000, replace=TRUE))
df2 <- data.frame(zipcode=sample(1:280000, 1.1e6, replace=TRUE),
                  housenum=sample(1:100, 1.1e6, replace=TRUE))

Now we can perform the efficient computation of the GPS data:

get.gps <- function(df1, df2) {
  # Add ID to df2
  df2$id <- 1:nrow(df2)
  m <- merge(df1, df2, by.x="Zipcode", by.y="zipcode")
  m <- m[m$housenum >= m$From &
         m$housenum <= m$To &
         (m$Type == "mixed" |
          (m$Type == "odd" & m$housenum %% 2 == 1) |
          (m$Type == "even" & m$housenum %% 2 == 0)),]
  m <- m[!duplicated(m$id) & !duplicated(m$id, fromLast=TRUE),]
  GPS <- rep(NA, nrow(df2))
  GPS[m$id] <- m$GPS
  return(GPS)
}
system.time(get.gps(df1, df2))
#    user  system elapsed 
#  16.197   0.561  17.583 

This is a much more acceptable runtime -- 18 seconds instead of the 90 hours that you estimated in the comment of my other answer!

Upvotes: 1

josliber
josliber

Reputation: 44340

I would just loop through each of the elements in df2, implementing the logic needed to check if the zip code matches and the element ranges are right and the even/odd are correct:

# Clean up data (character zip codes and fix the 7914 vs. 7941 issue in zip codes)
df2<-data.frame(zipcode=c("7941AH", "7941AH", "7941AH", "7941AG","7941AG","7941AZ"), 
                housenum=c(18, 19, 50, 32, 104,11))
df1$Zipcode <- as.character(df1$Zipcode)
df2$zipcode <- as.character(df2$zipcode)

# Loop to compute the GPS values
sapply(seq(nrow(df2)), function(x) {
  m <- df2[x,]
  matched <- df1$Zipcode == m$zipcode &
    m$housenum >= df1$From &
    m$housenum <= df1$To &
    (df1$Type == "mixed" |
     (df1$Type == "odd" & m$housenum %% 2 == 1) |
     (df1$Type == "even" & m$housenum %% 2 == 0))
  if (sum(matched) != 1) {
    return(NA)  # No matches or multiple matches
  } else {
    return(df1$GPS[matched])
  }
})
# [1] 12345    NA    NA 54321    NA    NA

By inspection, only the first and fourth elements of df2 are matched by one of the rules in df1.

Upvotes: 1

Related Questions