Reputation: 606
I have written some code which loops through a list of urls, opens them using urllib.request and then parses them using beautifulsoup. The only problem is that the list is quite long (about 5000) and the code runs successfully for about 200 urls before hanging indefinitely. Is there a way to either a) skip to the next url after a specific time e.g. 30 secs or b) reattempt to open the url a set number of times, before moving on to the next item?
from bs4 import BeautifulSoup
import csv
import urllib.request
with open('csv_file.csv', 'r') as f:
reader = csv.reader(f)
urls_list = list(reader)
for j in range(0, len(urls_list)):
url= ''.join(urls_list[j])
id=url[-10:].replace(".html","")
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
s = urlopen(req).read()
soup = BeautifulSoup(s, "lxml")
Any suggestions much appreciated!
Upvotes: 0
Views: 1158
Reputation: 140266
the doc (python 2) says:
The urllib2 module defines the following functions: urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]]) Open the URL url, which can be either a string or a Request object.
Adapt your code like this:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
s = urlopen(req,timeout=10).read() # 10 seconds
exception HTTPError as e:
print(str(e)) # print error detail (this may not be a timeout after all!)
continue # skip to next element
Upvotes: 1