onxx
onxx

Reputation: 1435

Multithreading issues with Pandas

I have a very large excel file of 1000+ street intersections that I need to find the Longitude and latitudes for and then write that info to file/list for a different program to consume.

What I'm stuck on is on how to build a more efficient script using multithreading/multiprocessing, I have looked through other questions/post but I'm i find it all a bit confusing. The code below takes roughly ~ 10+ mins. Any help would be great.

from geopy.geocoders import ArcGIS
import pandas
from datetime import datetime


start_time = datetime.now()
def my_LatLong(address):
    n = ArcGIS().geocode(address)
    if n != None: 
        return n
    else:
        return [address,"None"]


df = pandas.read_excel("street_sample.xlsx",sheet_name=0)
count = 0
street_list = df.loc[:,"Description"]

new_list =[]
for i in list(street_list):
    location = my_LatLong(f"{i.split('-')[0]}, Vancouver, Canada")
    if location != None:
        new_list.append([f"{len(street_list)}, {list(location)}"])
        print(f"{count}/{len(street_list)} - {i} = Completed\t\t\t\t" ,end='\r')
    else:
        print(f"{count}/{len(street_list)} - {i} == None value \t\t\t",end='\r')

    count += 1

# doing something with new_list

endtime_time = datetime.now()

print (f"Program ran for:  {endtime_time -start_time}")

street_sample.xlsx

ID Description
12501x 1900 W Georgia - ONAT_STREET:
12501x 4th/6th Diversion & 6th Ped - ONAT_STREET:
12501x 4th/6th Diversion & 6th Semi - ONAT_STREET:
12501x Abbott & Cordova - ONAT_STREET:
12501x Abbott & Expo - ONAT_STREET:
12501x Abbott & Hastings - ONAT_STREET:
12501x Abbott & Keefer - ONAT_STREET:
12501x Aberdeen & Kingsway - ONAT_STREET:
12501x Alberta & 49th - ONAT_STREET: 70175
12501x Alder & 12th - ONAT_STREET:
12501x Alder & 6th - ONAT_STREET:
12501x Alder & Broadway - ONAT_STREET:
12501x Alexandra & King Edward - ONAT_STREET:
12501x Alma & 10th - ONAT_STREET:
12501x Alma & 4th - ONAT_STREET:
12501x Alma & 6th - ONAT_STREET:
12501x Alma & Broadway - ONAT_STREET:
12501x Alma & Point Grey Road - ONAT_STREET:
12501x Anderson & 2nd / Lamey's Mill - ONAT_STREET:
12501x Anderson (Granville) & 4th - ONAT_STREET:
12501x Angus & 41st - ONAT_STREET:
12501x Angus & Marine - ONAT_STREET:
12501x Arbutus & 10th - ONAT_STREET:
12501x Arbutus & 11th - ONAT_STREET:
12501x Arbutus & 12th - ONAT_STREET:
12501x Arbutus & 16th - ONAT_STREET:
12501x Arbutus & 20th - ONAT_STREET:
12501x Arbutus & 33rd - ONAT_STREET:
12501x Arbutus & 4th - ONAT_STREET:
12501x Arbutus & 8th - ONAT_STREET:
12501x Arbutus & Broadway - ONAT_STREET:
12501x Arbutus & Cornwall - ONAT_STREET:
12501x Arbutus & King Edward - ONAT_STREET:
12501x Arbutus & Lahb - ONAT_STREET:

Upvotes: 0

Views: 136

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50278

The problem does not comes from Pandas but ArcGIS().geocode(address) which is insanely slow. Indeed, on my machine, this line takes 400 ms/request. Each request send a slow network query to the online ArcGIS API. Using multiprocessing will not help much as you will quickly reach additional limitations (limited rate of API request, saturation of the website). You need to send batch requests. Unfortunately this does not seems supported by the geopy package. If you are tied to ArcGIS, you need to use their own API. You can find more information about how to do that on the ArcGIS documentation.

Upvotes: 1

Related Questions