William Gunn
William Gunn

Reputation: 2995

How can I fill in missing categorical values using geographical proximity using R?

I have some data that looks like this:

ID      lat      long     university   date        cat2    cat3   cat4   ...
00001   32.001   -64.001  MIT          2011-07-01  xyz     foo    NA     ...
00002   45.783   67.672   Harvard      2011-07-01  abc     NA     lion   ...
00003   54.823   78.762   Stanford     2011-07-01  xyz     bar    NA     ...
00004   76.782   23.989   IIT Bombay   2011-07-02  NA      foo    NA     ...
00005   32.010   -64.010  NA           2011-07-02  NA      NA     hamster...
00006   32.020   -64.020  NA           2011-07-03  NA      NA     NA     ...
00006   45.793   67.700   NA           2011-08-01  NA      bar    badger ...

I want to impute missing values for the university column based on the lat-long coordinates. This is obviously made up, as the data's 500K rows and rather sparse on the university column. Imputation packages like Amelia seem to want to fit numerical data according to a linear model and zoo seems to want to fill in missing values based on some sort of ordered series, which I don't have. I want to match close lat-longs, not just exact lat-long pairs, so I can't just fill in one column by matching values from another.

I plan to approach the problem by finding all the lat-long pairs associated with a university, draw a bounding box around them, then for all rows with lat-long pairs but missing university data, add the appropriate value for university depending on which lat-long box they're in, or perhaps within a certain radius of the midpoint of the known locations.

Has anyone ever done something similar? Are there any packages that make it easier to group geographically proximate lat-long pairs or maybe even to do geographically-based imputation?

If that works, I'd like to take a crack at imputing some of the other missing values based on existing value in the data (like 90% of rows with xyz, foo, Harvard values also have lion in the 4th category, so we can impute some missing values for cat4) but that's another question and I would imagine a much harder one, which I might not even have enough data to do successfully.

Upvotes: 2

Views: 852

Answers (1)

JD Long
JD Long

Reputation: 60746

I don't have a package in mind to solve what you're describing. I've done some similar type analysis and I ended up writing something bespoke.

Just to give you a jumping off point, here's an example of one way of doing a nearest neighbor calculation. Calculating neighbors is kind of slow because, obviously, you have to calculate every point against every other point.

## make some pretend data
n <- 1e4
lat <- rnorm(n)
lon <- rnorm(n)
index <- 1:n
myDf <- data.frame(lat, lon, index)

## create a few helper functions
cartDist <- function(x1, y1, x2, y2){
  ( (x2 - x1)^2 - (y2 - y1)^2 )^.5
}

nearestNeighbors <- function(x1, y1, x2, y2, n=1){
  dists <- cartDist(x1, y1, x2, y2)
  orders <- order(dists)
  index <- which(orders <= n)
  neighborValues <- dists[index]
  return(list(index, neighborValues))
}


## this could be done in an apply statement
## but it's fugly enough as a loop
system.time({
for (i in 1:nrow(myDf)){
  myDf[i,]$nearestNeighbor <- myDf[nearestNeighbors( myDf[i,]$lon, myDf[i,]$lat,  myDf[-i,]$lon, myDf[-i,]$lat )[[1]],]$index
}
})

Upvotes: 2

Related Questions