Distribute web-scraping write-to-file to parallel processes in Python?

Question

I'm scraping some JSON data from a website, and need to do this ~50,000 times (all data is for distinct zip codes over a 3-year period). I timed out the program for about 1,000 calls, and the average time per call was 0.25 seconds, leaving me with about 3.5 hours of runtime for the whole range (all 50,000 calls).

How can I distribute this process across all of my cores? The core of my code is pretty much this:

with open("U:/dailyweather.txt", "r+") as f:
    f.write("var1	var2	var3	var4	var5	var6	var7	var8	var9
")
    writeData(zips, zip_weather_links, daypart)

Where writeData() looks like this:

def writeData(zipcodes, links, dayparttime):
    for z in zipcodes:
        for pair in links:
            ## do some logic ##
            f.write("%s	%s	%s	%s	%s	%s	%s	%s	%s
" % (var1, var2, var3, var4, var5, 
                                                              var6, var7, var8, var9))

zips looks like this:

zips = ['55111', '56789', '68111', ...]

and zip_weather_links is just a dictionary of (URL, date) for each zip code:

zip_weather_links['55111'] = [('https://website.com/55111/data', datetime.datetime(2013, 1, 1, 0, 0, 0), ...]

How can I distribute this using Pool or multiprocessing? Or would distribution even save time?

Distribute web-scraping write-to-file to parallel processes in Python?

Answers (1)

Related Questions