Reputation: 12515
I'm scraping some JSON data from a website, and need to do this ~50,000 times (all data is for distinct zip codes over a 3-year period). I timed out the program for about 1,000 calls, and the average time per call was 0.25 seconds, leaving me with about 3.5 hours of runtime for the whole range (all 50,000 calls).
How can I distribute this process across all of my cores? The core of my code is pretty much this:
with open("U:/dailyweather.txt", "r+") as f:
f.write("var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7\tvar8\tvar9\n")
writeData(zips, zip_weather_links, daypart)
Where writeData()
looks like this:
def writeData(zipcodes, links, dayparttime):
for z in zipcodes:
for pair in links:
## do some logic ##
f.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (var1, var2, var3, var4, var5,
var6, var7, var8, var9))
zips
looks like this:
zips = ['55111', '56789', '68111', ...]
and zip_weather_links
is just a dictionary of (URL, date) for each zip code:
zip_weather_links['55111'] = [('https://website.com/55111/data', datetime.datetime(2013, 1, 1, 0, 0, 0), ...]
How can I distribute this using Pool
or multiprocessing
? Or would distribution even save time?
Upvotes: 0
Views: 241
Reputation: 46
You want to "Distribute web-scraping write-to-file to parallel processes in Python". For a start let's look where the most time is used for Web-Scraping.
The latency for the HTTP-Requests is much higher than for Harddisks. Link: Latency comparison. Small writes to a Harddisk are significantly slower than bigger writes. SSDs have a much higher random write speed so this effect doesn't affect them so much.
some example code with IPython parallel:
from ipyparallel import Client
import requests
rc = Client()
lview = rc.load_balanced_view()
worklist = ['http://xkcd.com/614/info.0.json',
'http://xkcd.com/613/info.0.json']
@lview.parallel()
def get_webdata(w):
import requests
r = requests.get(w)
if not r.status_code == 200:
return (w, r.status_code,)
return (w, r.json(),)
#get_webdata will be called once with every element of the worklist
proc = get_webdata.map(worklist)
results = proc.get()
# results is a list with all the return values
print(results[1])
# TODO: write the results to disk
You have to start the IPython parallel workers first:
(py35)River:~ rene$ ipcluster start -n 20
Upvotes: 1