Reputation: 527
I'm attempting to download a few thousand images using Python and the multiprocessing and requests libs. Things start off fine but about 100 images in, everything locks up and I have to kill the processes. I'm using python 2.7.6. Here's the code:
import requests
import shutil
from multiprocessing import Pool
from urlparse import urlparse
def get_domain_name(s):
domain_name = urlparse(s).netloc
new_s = re.sub('\:', '_', domain_name) #replace colons
return new_s
def grab_image(url):
response = requests.get(url, stream=True, timeout=2)
if response.status_code == 200:
img_name = get_domain_name(url)
with open(IMG_DST + img_name + ".jpg", 'wb') as outf:
shutil.copyfileobj(response.raw, outf)
del response
def main():
with open(list_of_image_urls, 'r') as f:
urls = f.read().splitlines()
urls.sort()
pool = Pool(processes=4, maxtasksperchild=2)
pool.map(grab_image, urls)
pool.close()
pool.join()
if __name__ == "__main__":
main()
Edit: After changing the multiprocessing import to multiprocessing.dummy to use threads instead of processes I am still experiencing the same problem. It seems I'm sometimes hitting a motion jpeg stream instead of a single image, which is causing the associated problems. In order to deal with this issue I'm using a context manager and I created a FileTooBigException. While I haven't implement checking to make sure I've actually downloaded an image file and some other house cleaning, I thought the below code might be useful for someone:
class FileTooBigException(requests.exceptions.RequestException):
"""File over LIMIT_SIZE"""
def grab_image(url):
try:
img = ''
with closing(requests.get(url, stream=True, timeout=4)) as response:
if response.status_code == 200:
content_length = 0
img_name = get_domain_name(url)
img = IMG_DST + img_name + ".jpg"
with open(img, 'wb') as outf:
for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
outf.write(chunk)
content_length = content_length + CHUNK_SIZE
if(content_length > LIMIT_SIZE):
raise FileTooBigException(response)
except requests.exceptions.Timeout:
pass
except requests.exceptions.ConnectionError:
pass
except socket.timeout:
pass
except FileTooBigException:
os.remove(img)
pass
And, any suggested improvements welcome!
Upvotes: 0
Views: 1705
Reputation: 3405
There is no point in using multiprocessing
for I/O concurrency. In network I/O the thread involved just waits most of the time doing nothing. And Python threads are excellent for doing nothing. So use a threadpool, instead of a processpool. Each process consumes a lot of resouces and are unnecessary for I/O bound activities. While threads share the process state and are exactly what you are looking for.
Upvotes: 1