Drew Folz
Drew Folz

Reputation: 3

Pandas and Multiprocessing

I am using a FCC api to convert lat/long coordinates into block group codes:

import pandas as pd
import numpy as np
import urllib
import time
import json

# getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='

getup1 = '&longitude='

getup2 = '&showall=false'

lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
 '33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
 '39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
 '32.7554883','42.331427','31.7775757','35.1495343']

long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
 '-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
 '-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
 '-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']

#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']

new_list = []

def block(x):
    for index,row in x.iterrows():
        #request url and read the output
        a = urllib.request.urlopen(getup + row['lat'] + getup1 + row['long'] + getup2).read()
        #load json output in to a form python can understand
        a1 = json.loads(a)
        #append output to an empty list.
        new_list.append(a1['Block']['FIPS'])

#call the function with latlong as the argument.        
block(latlong)

#print the list, note: it is important that function appends to the list
print(new_list)

gives this output:

['360610031001021', '060372074001033', '170318391001104', '482011000003087', 
 '421010005001010', '040131141001032', '480291101002041', '060730053003011', 
 '481130204003064', '060855010004004', '484530011001092', '180973910003057', 
 '120310010001023', '060750201001001', '390490040001005', '371190001005000', 
 '484391233002071', '261635172001069', '481410029001001', '471570042001018']

The problem with this script is that I can only call the api once per row. It takes about 5 minutes per thousand for the script to run, which is not acceptable with 1,000,000+ entries I am planning on using this script with.

I want to use multiprocessing to parallel this function to decrease the time to run the function. I have tried to look in to the multiprocessing handbook, but have not been able to figure out how to run the function and append the output in to an empty list in parallel.

Just for reference: I am using python 3.6

Any guidance would be great!

Upvotes: 0

Views: 781

Answers (1)

mkastner
mkastner

Reputation: 113

You do not have to implement the parallelism yourself, there are libraries better than urllib, e.g. requests [0] and some spin-offs [1] which use either threads or futures. I guess you need to check yourself which one is the fastest.

Because of the small amount of dependencies I like the requests-futures best, here my implementation of your code using ten threads. The library would even support processes if you believe or figure out that it is somehow better in your case:

import pandas as pd
import numpy as np
import urllib
import time
import json
from concurrent.futures import ThreadPoolExecutor

from requests_futures.sessions import FuturesSession

#getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='

getup1 = '&longitude='

getup2 = '&showall=false'

lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
 '33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
 '39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
 '32.7554883','42.331427','31.7775757','35.1495343']

long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
 '-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
 '-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
 '-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']

#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']

def block(x):
    requests = []
    session = FuturesSession(executor=ThreadPoolExecutor(max_workers=10))
    for index, row in x.iterrows():
        #request url and read the output
        url = getup+row['lat']+getup1+row['long']+getup2        
        requests.append(session.get(url))
    new_list = []
    for request in requests:
        #load json output in to a form python can understand
        a1 = json.loads(request.result().content)
        #append output to an empty list.
        new_list.append(a1['Block']['FIPS'])
    return new_list

#call the function with latlong as the argument.        
new_list = block(latlong)

#print the list, note: it is important that function appends to the list
print(new_list)

[0] http://docs.python-requests.org/en/master/

[1] https://github.com/kennethreitz/grequests

Upvotes: 1

Related Questions