Reputation: 871
I am using python to get HTML data from multiple pages at a URL. I found that urllib throws an exception when a URL does not exist. How do I retrieve the HTML of that custom 404 error page (the page where it says something like "Page is not found.")
Current code:
try:
req = Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
client = urlopen(req)
#downloading html data
page_html = client.read()
#closing connection
client.close()
except:
print("The following URL was not found. Program terminated.\n" + URL)
break
Upvotes: 0
Views: 1508
Reputation: 263
To preserve the comment that also answers the question, and also because it's what I was looking for, a way to do this without going outside urllib:
By t.m.adam at Nov 4, 2018 at 10:07
See HTTPError. It has a .read() method which returns the response content. –
Upvotes: 0
Reputation: 625
Have you tried the requests
library?
Just install the library with pip
pip install requests
And use it like this
import requests
response = requests.get('https://stackoverflow.com/nonexistent_path')
print(response.status_code) # 404
print(response.text) # Prints the raw HTML response
Upvotes: 2