Reputation: 2995
I have some data that looks like this:
ID lat long university date cat2 cat3 cat4 ...
00001 32.001 -64.001 MIT 2011-07-01 xyz foo NA ...
00002 45.783 67.672 Harvard 2011-07-01 abc NA lion ...
00003 54.823 78.762 Stanford 2011-07-01 xyz bar NA ...
00004 76.782 23.989 IIT Bombay 2011-07-02 NA foo NA ...
00005 32.010 -64.010 NA 2011-07-02 NA NA hamster...
00006 32.020 -64.020 NA 2011-07-03 NA NA NA ...
00006 45.793 67.700 NA 2011-08-01 NA bar badger ...
I want to impute missing values for the university column based on the lat-long coordinates. This is obviously made up, as the data's 500K rows and rather sparse on the university column. Imputation packages like Amelia seem to want to fit numerical data according to a linear model and zoo seems to want to fill in missing values based on some sort of ordered series, which I don't have. I want to match close lat-longs, not just exact lat-long pairs, so I can't just fill in one column by matching values from another.
I plan to approach the problem by finding all the lat-long pairs associated with a university, draw a bounding box around them, then for all rows with lat-long pairs but missing university data, add the appropriate value for university depending on which lat-long box they're in, or perhaps within a certain radius of the midpoint of the known locations.
Has anyone ever done something similar? Are there any packages that make it easier to group geographically proximate lat-long pairs or maybe even to do geographically-based imputation?
If that works, I'd like to take a crack at imputing some of the other missing values based on existing value in the data (like 90% of rows with xyz, foo, Harvard values also have lion in the 4th category, so we can impute some missing values for cat4) but that's another question and I would imagine a much harder one, which I might not even have enough data to do successfully.
Upvotes: 2
Views: 852
Reputation: 60746
I don't have a package in mind to solve what you're describing. I've done some similar type analysis and I ended up writing something bespoke.
Just to give you a jumping off point, here's an example of one way of doing a nearest neighbor calculation. Calculating neighbors is kind of slow because, obviously, you have to calculate every point against every other point.
## make some pretend data
n <- 1e4
lat <- rnorm(n)
lon <- rnorm(n)
index <- 1:n
myDf <- data.frame(lat, lon, index)
## create a few helper functions
cartDist <- function(x1, y1, x2, y2){
( (x2 - x1)^2 - (y2 - y1)^2 )^.5
}
nearestNeighbors <- function(x1, y1, x2, y2, n=1){
dists <- cartDist(x1, y1, x2, y2)
orders <- order(dists)
index <- which(orders <= n)
neighborValues <- dists[index]
return(list(index, neighborValues))
}
## this could be done in an apply statement
## but it's fugly enough as a loop
system.time({
for (i in 1:nrow(myDf)){
myDf[i,]$nearestNeighbor <- myDf[nearestNeighbors( myDf[i,]$lon, myDf[i,]$lat, myDf[-i,]$lon, myDf[-i,]$lat )[[1]],]$index
}
})
Upvotes: 2