Reputation: 89
I have a huge data frame (about 1 million data points) with longitude and latitude information. I would like to get country & state/province information. However, the code doesn't work as efficiently as I thought
Below is my code:
Sample data frame:
df = data.frame(
ID =c(A00001,A00002,A00003,A00004,A00005)
longitude = c(-98.84295,-91.11844,-75.91037,-71.00733,-92.29651)
latitude= c(43.98332,40.17851,39.26118,46.70087,45.49510)
)
First: read geoinformation
library(sp)
library(rgdal)
library(dplyr)
countries_map<- readOGR(dsn="Country", layer="ne_10m_admin_0_countries")
states_map <- readOGR(dsn="States", layer="ne_10m_admin_1_states_provinces")
Then, build a function and export the result to the designated data frame
geo_to_location <-function(lat,long){
#First the coordinates are transformed to spatialpoints
points<-SpatialPoints(data.frame(long,lat))
#Creating a projection of the coordinates on the map of countries
proj4string(points) <- proj4string(countries_map)
country<-as.character(over(points, countries_map)$NAME)
#The same for state/province
proj4string(points) <- proj4string(states_map)
state<-as.character(over(points, states_map)$name)
dplyr::bind_rows(setNames(c(country,state), c("Country", "State")))
}
df = df %>% dplyr::bind_cols(purrr::map2_dfr(.$latitude, .$longitude, geo_to_location ))
This method works but 400,000 points already takes about 30 mins to complete. I have more than 400k points to process. Is there any more efficient way to handle this matter?
Or, there's no more efficient way to process this work?
Thank you all in advance.
Upvotes: 0
Views: 799
Reputation: 11
I was trying to figure out the same thing. I had a huge database with lat and lon (and geolocation) but no locations. I needed country, state (US) and county (US). The solution was shockingly simple. Use the map.where()
function from the maps package. It worked for me. For example for country is just:
map.where(database = "world", df$lon, df$lat)
For US just put in "state" or "county" for the "world".
Upvotes: 1
Reputation: 89
Thanks to @starja, who suggested vectorizing the function and use data.table to replace dplry.
I used the first 500 rows for test and got a huge difference in the turnaround time.
Below is the modified code:
geo_to_location <-function(lat,long){
#First the coordinates are transformed to spatialpoints
points<-SpatialPoints(data.frame(long,lat))
#Creating a projection of the coordinates on the map of countries
proj4string(points) <- proj4string(countries_map)
country<-as.character(over(points, countries_map)$NAME)
#The same for state
proj4string(points) <- proj4string(states_map)
state<-as.character(over(points, states_map)$name)
return(list(country = country, state = state ))
}
df = as.data.table(df)
df[, c("Country","State_Province") := geo_to_location (latitude, longitude)]
df = as.data.frame(df)
The original method took about 3.194 mins to process 500 points. The new method took about 0.651 secs. If there's another more efficient way to handle this matter, please let me know that I can learn a more advanced skill.
Again, thank you for the suggestion and help.
Upvotes: 1