Reputation: 3581
I have a python script running which basically requests for a 1000 urls over http and logs their response. Here is the function that downloads the url page.
def downld_url(url, output):
print "Entered Downld_url and scraping the pdf/doc/docx file now..."
global error
try:
# determine all extensions we should account for
f = urllib2.urlopen(url)
data = f.read()
dlfn = urlsplit(url).path.split('.')[-1]
print "The extension of the file is: " + str(dlfn)
dwnladfn = ImageDestinationPath + "/" + output + "." + dlfn
with open(dwnladfn, "wb") as code:
code.write(data)
code.close()
_Save_image_to_s3(output+"."+dlfn, ImageDestinationPath + "/" +output +
"." + dlfn)
print dlfn + " file saved to S3"
os.remove(ImageDestinationPath + "/" +output + "." + dlfn)
print dlfn + "file removed from local folder"
update_database(output,output+"."+dlfn, None)
return
except Exception as e:
error = "download error: " + str(e)
print "Error in downloading file: " + error
return
Now this runs smoothly for 100-200 urls in the pipeline but after that the response starts to get very slow and ultimately the response just times out. I am, guessing this is because of the request overload. Is there some efficient way to do this without overloading requests?
Upvotes: 0
Views: 296
Reputation: 9753
I don't know where the issue comes from, but if it is related to having too much requests in the same process, you could try multiprocessing, as a workaround.
It may also speed up the whole thing, as you can do multiple tasks at the same time (for instance, a process downloading while another is writing to the disk, …). I did this for a similar thing, it was really better (increased total download speed too)
Upvotes: 1