Lucas Ou-Yang
Lucas Ou-Yang

Reputation: 5655

Python requests multithreading "Max Retries exceeded with url" Caused by <class 'socket.gaierror'>

I'm trying to concurrently download a bunch of urls with both the requests module and python's built in multiprocessing library. When using the two together, i'm experiencing some errors which definitely do not look right. I sent out 100 requests with 100 threads and usually 50 of them end in success while the other 50 receive this message:

   TTPConnectionPool(host='www.reuters.com', port=80): Max retries exceeded with url: 
/video/2013/10/07/breakingviews-batistas-costly-bluster?videoId=274054858&feedType=VideoRSS&feedName=Business&videoChannel=5&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+reuters%2FUSVideoBusiness+%28Video+%2F+US+%2F+Business%29 (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)

Neither the max retries nor the nodename not provided lines look right.

Here is my requests setup:

import requests

req_kwargs = {
  'headers' : {'User-Agent': 'np/0.0.1'},
  'timeout' : 7,
  'allow_redirects' : True
}

# I left out the multiprocessing code but that part isn't important
resp = requests.get(some_url, req_kwargs**)

Does anyone know how to prevent or at least move further in debugging this?

Thank you.

Upvotes: 4

Views: 6025

Answers (2)

Napo Mokoetle
Napo Mokoetle

Reputation: 1

[Errno 8] nodename nor servname provided, or not known

Simply implies it can't resolve www.reuters.com either place the ip resolution in the hosts file or domain

Upvotes: 0

flyer
flyer

Reputation: 9806

I think it may be caused by high visit frequency that the site doesn't allow.

Try the following:

  • Just use a lower visit frequency to crawl that site and when you receive the same error again, visit the site in your web browser to see if the spider has been forbidden by the site.
  • Use a proxy pool to crawl the site to prevent the site deeming your visit frequency high and forbidding your spider.
  • Enrich your http request headers and make it like emitted by a web browser.

Upvotes: 2

Related Questions