Python hangs when looping throughlong list of urls using urllib.request

Question

I have written some code which loops through a list of urls, opens them using urllib.request and then parses them using beautifulsoup. The only problem is that the list is quite long (about 5000) and the code runs successfully for about 200 urls before hanging indefinitely. Is there a way to either a) skip to the next url after a specific time e.g. 30 secs or b) reattempt to open the url a set number of times, before moving on to the next item?

from bs4 import BeautifulSoup
import csv
import urllib.request
with open('csv_file.csv', 'r') as f:
  reader = csv.reader(f)
  urls_list = list(reader)
  for j in range(0, len(urls_list)):
    url= ''.join(urls_list[j])
    id=url[-10:].replace(".html","")

    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    s = urlopen(req).read()
    soup = BeautifulSoup(s, "lxml")

Any suggestions much appreciated!

Jean-Fran&#231;ois Fabre · Accepted Answer

the doc (python 2) says:

The urllib2 module defines the following functions: urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]]) Open the URL url, which can be either a string or a Request object.

Adapt your code like this:

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
    s = urlopen(req,timeout=10).read()   # 10 seconds
exception HTTPError as e:
    print(str(e))  # print error detail (this may not be a timeout after all!)
    continue   # skip to next element

Python hangs when looping throughlong list of urls using urllib.request

Answers (1)

Related Questions