Reputation: 18197
The same code below works for many webpages, but for a few like the one below, it gives error:
Error: Error reading file 'http://akademos-garden.com/homeschooling-tips-work-home-parents': failed to load HTTP resource
Python to reproduce:
from lxml.html import parse
import requests
import pprint
page_url = 'http://akademos-garden.com/homeschooling-tips-work-home-parents/'
try:
parsed_page = parse(page_url)
dom = parsed_page.getroot()
except Exception as e:
# TODO - log this into some other error table to come back and research
errMsg = f"Error: {e} "
print(errMsg)
print("Try get without User-Agent")
result = requests.get(page_url).status_code
pprint.pprint(result)
print("Try get with User-Agent")
result = requests.get(page_url, headers={'User-Agent': None}).status_code
pprint.pprint(result)
This post refers to adding the User-Agent, but I don't understand how to do that with lxml. Both the request.get's above run with no error, return http status=200.
python lxml.html.parse not reading url.
If I have to use requests.get, I can do that, but then how do I get it in the dom object?
Upvotes: -1
Views: 25
Reputation: 18197
The following seems to work, I just don't understand why the extra steps are necessary. If anyone could explain, that would be appreciated.
from lxml.html import parse, etree
import requests
import pprint
try:
# old way of doing it
# parsed_page = parse(page_url)
# dom = parsed_page.getroot()
# so goal of the new way is to put data in the same dom variable
print ("retrieve page using requests.get")
result = requests.get(page_url, headers={'User-Agent': None})
print("result.status_code=", result.status_code)
parser = etree.HTMLParser()
dom = etree.fromstring(result.content, parser)
#prove that the dom variable works like it did before
links = dom.cssselect('a')
for link in links:
print ("Link:", link.text)
except Exception as e:
# TODO - log this into some other error table to come back and research
errMsg = f"Error: {e} "
print(errMsg)
Upvotes: 0