python lxml.html.parse not reading url - or how to get request.get into lxml.html.dom?

Question

The same code below works for many webpages, but for a few like the one below, it gives error:

Error: Error reading file 'http://akademos-garden.com/homeschooling-tips-work-home-parents': failed to load HTTP resource

Python to reproduce:

from lxml.html import parse
import requests
import pprint 

page_url = 'http://akademos-garden.com/homeschooling-tips-work-home-parents/'

try:
    parsed_page = parse(page_url)

    dom = parsed_page.getroot()

except Exception as e:
    # TODO - log this into some other error table to come back and research
    errMsg = f"Error: {e} "
    print(errMsg)


print("Try get without User-Agent")
result = requests.get(page_url).status_code
pprint.pprint(result)

print("Try get with User-Agent")
result = requests.get(page_url, headers={'User-Agent': None}).status_code
pprint.pprint(result)

This post refers to adding the User-Agent, but I don't understand how to do that with lxml. Both the request.get's above run with no error, return http status=200.

python lxml.html.parse not reading url.

If I have to use requests.get, I can do that, but then how do I get it in the dom object?

python lxml.html.parse not reading url - or how to get request.get into lxml.html.dom?

Answers (1)

Related Questions