Reputation: 1435
I have a very large excel file of 1000+ street intersections that I need to find the Longitude and latitudes for and then write that info to file/list for a different program to consume.
What I'm stuck on is on how to build a more efficient script using multithreading/multiprocessing, I have looked through other questions/post but I'm i find it all a bit confusing. The code below takes roughly ~ 10+ mins. Any help would be great.
from geopy.geocoders import ArcGIS
import pandas
from datetime import datetime
start_time = datetime.now()
def my_LatLong(address):
n = ArcGIS().geocode(address)
if n != None:
return n
else:
return [address,"None"]
df = pandas.read_excel("street_sample.xlsx",sheet_name=0)
count = 0
street_list = df.loc[:,"Description"]
new_list =[]
for i in list(street_list):
location = my_LatLong(f"{i.split('-')[0]}, Vancouver, Canada")
if location != None:
new_list.append([f"{len(street_list)}, {list(location)}"])
print(f"{count}/{len(street_list)} - {i} = Completed\t\t\t\t" ,end='\r')
else:
print(f"{count}/{len(street_list)} - {i} == None value \t\t\t",end='\r')
count += 1
# doing something with new_list
endtime_time = datetime.now()
print (f"Program ran for: {endtime_time -start_time}")
street_sample.xlsx
ID | Description |
---|---|
12501x | 1900 W Georgia - ONAT_STREET: |
12501x | 4th/6th Diversion & 6th Ped - ONAT_STREET: |
12501x | 4th/6th Diversion & 6th Semi - ONAT_STREET: |
12501x | Abbott & Cordova - ONAT_STREET: |
12501x | Abbott & Expo - ONAT_STREET: |
12501x | Abbott & Hastings - ONAT_STREET: |
12501x | Abbott & Keefer - ONAT_STREET: |
12501x | Aberdeen & Kingsway - ONAT_STREET: |
12501x | Alberta & 49th - ONAT_STREET: 70175 |
12501x | Alder & 12th - ONAT_STREET: |
12501x | Alder & 6th - ONAT_STREET: |
12501x | Alder & Broadway - ONAT_STREET: |
12501x | Alexandra & King Edward - ONAT_STREET: |
12501x | Alma & 10th - ONAT_STREET: |
12501x | Alma & 4th - ONAT_STREET: |
12501x | Alma & 6th - ONAT_STREET: |
12501x | Alma & Broadway - ONAT_STREET: |
12501x | Alma & Point Grey Road - ONAT_STREET: |
12501x | Anderson & 2nd / Lamey's Mill - ONAT_STREET: |
12501x | Anderson (Granville) & 4th - ONAT_STREET: |
12501x | Angus & 41st - ONAT_STREET: |
12501x | Angus & Marine - ONAT_STREET: |
12501x | Arbutus & 10th - ONAT_STREET: |
12501x | Arbutus & 11th - ONAT_STREET: |
12501x | Arbutus & 12th - ONAT_STREET: |
12501x | Arbutus & 16th - ONAT_STREET: |
12501x | Arbutus & 20th - ONAT_STREET: |
12501x | Arbutus & 33rd - ONAT_STREET: |
12501x | Arbutus & 4th - ONAT_STREET: |
12501x | Arbutus & 8th - ONAT_STREET: |
12501x | Arbutus & Broadway - ONAT_STREET: |
12501x | Arbutus & Cornwall - ONAT_STREET: |
12501x | Arbutus & King Edward - ONAT_STREET: |
12501x | Arbutus & Lahb - ONAT_STREET: |
Upvotes: 0
Views: 136
Reputation: 50278
The problem does not comes from Pandas but ArcGIS().geocode(address)
which is insanely slow. Indeed, on my machine, this line takes 400 ms/request. Each request send a slow network query to the online ArcGIS API. Using multiprocessing will not help much as you will quickly reach additional limitations (limited rate of API request, saturation of the website). You need to send batch requests. Unfortunately this does not seems supported by the geopy
package. If you are tied to ArcGIS, you need to use their own API. You can find more information about how to do that on the ArcGIS documentation.
Upvotes: 1