Reputation: 119
I have 2 datasets one for hospitals and another for procedures. Each dataset has latitude and longitude coordinates. Procedures are either given in or out of the hospital, though the coordinates are not necessarily precise if given in the hospitals. Im trying to form a radius of a certain size around each of the hospitals and determine how many procedure points fall within that radius on average. So if for example I have 100 hospitals and 3000 procedures, I want to form a radius around all hospitals and see on average how many hospitals fall into that specified radius. My initial code is below, but I know this can be done faster. coded in R. Thanks!
for(i in 1:NROW(hospitals)){
hospital <- hospitals[i,]
radius <- .016
# find all the procedures that lie in the .016 sized radius from this hospital
hospital$latitude_low <- hospital$lat - radius
hospital$longitude_low <- hospital$long - radius
hospital$latitude_high <- hospital$lat + radius
hospital$longitude_high <- hospital$long + radius
in_rad <- procedures[(procedures$long >= hospital$longitude_low & procedures$long <=
hospital$longitude_high & procedures$lat <= hospital$latitude_high & procedures$lat >=
hospital$latitude_low),]
num <- NROW(in_rad)
hospitals[i,]$number_of_procedures <- num
}
Upvotes: 2
Views: 1818
Reputation: 47091
When you ask a question, you should always include some example data. Like this
lat <- c(-23.8, -25.8)
lon <- c(-49.6, -44.6)
hosp <- cbind(lon, lat)
lat <- c(-22.8, -24.8, -29.1, -28, -20)
lon <- c(-46.4, -46.3, -45.3, -40, -30)
procedures <- cbind(lon, lat)
Are your data in longitude/latitude? If so, you need to use a proper method to compute distances. For example
library(geosphere)
dm <- distm(procedures, hosp)
Or
library(raster)
d <- pointDistance(procedures, hosp, lonlat=TRUE)
Both compute the distance from all procedures to all hospitals. This will fail with very large datasets, but from what you describe it should work fine. Now you can use a threshold (here 400,000 m) to find out which procedures are within that distance of each hospital
apply(d < 400000, 2, which)
#[[1]]
#[1] 1 2
#[[2]]
#[1] 1 2 3
So procedure 1, 2 and 3 are within that distance to hospital 2
If your data are not longitude/latitude, you can use
d <- pointDistance(procedures, hosp, lonlat=FALSE)
Upvotes: 3
Reputation: 173793
There are a couple of things that could be improved here. Firstly, you are not actually calculating procedures done within a radius of 0.16 units from the hospital, but procedures done within a 0.32 * 0.32 units square with the hospital at its center. Probably not a huge deal for the specific problem, but its actually quicker to work out points within a particular distance, as you actually intended.
Secondly, you have have a tendency to store any variables you have calculated even if you're only going to use them once. This can help with understanding the code, but is sometimes less efficient and certainly makes your code longer, particularly if you like using long_descriptive_variable_names
.
Thirdly, at the end, you subset procedures
and then measure the number of rows, rather than just using the length of the subset itself.
Lastly (but less importantly), you write the result one value at a time into a new column. You can do this all in one gulp using sapply
instead.
So your code could be replaced with something much simpler, like:
hospitals$number_of_procedures <- sapply(1:NROW(hospitals), function(i)
{
d <- (procedures$long - hospitals[i,]$long)^2 + (procedures$lat - hospitals[i,]$lat)^2
length(which(d < 0.16^2))
})
Upvotes: 1