NealWalters
NealWalters

Reputation: 18197

python lxml.html.parse not reading url - or how to get request.get into lxml.html.dom?

The same code below works for many webpages, but for a few like the one below, it gives error:

Error: Error reading file 'http://akademos-garden.com/homeschooling-tips-work-home-parents': failed to load HTTP resource

Python to reproduce:

from lxml.html import parse
import requests
import pprint 

page_url = 'http://akademos-garden.com/homeschooling-tips-work-home-parents/'

try:
    parsed_page = parse(page_url)

    dom = parsed_page.getroot()

except Exception as e:
    # TODO - log this into some other error table to come back and research
    errMsg = f"Error: {e} "
    print(errMsg)


print("Try get without User-Agent")
result = requests.get(page_url).status_code
pprint.pprint(result)

print("Try get with User-Agent")
result = requests.get(page_url, headers={'User-Agent': None}).status_code
pprint.pprint(result)

This post refers to adding the User-Agent, but I don't understand how to do that with lxml. Both the request.get's above run with no error, return http status=200.

python lxml.html.parse not reading url.

If I have to use requests.get, I can do that, but then how do I get it in the dom object?

Upvotes: -1

Views: 25

Answers (1)

NealWalters
NealWalters

Reputation: 18197

The following seems to work, I just don't understand why the extra steps are necessary. If anyone could explain, that would be appreciated.

from lxml.html import parse, etree
import requests
import pprint
try:

    # old way of doing it
    # parsed_page = parse(page_url)
    # dom = parsed_page.getroot()

    # so goal of the new way is to put data in the same dom variable
    print ("retrieve page using requests.get")
    result = requests.get(page_url, headers={'User-Agent': None})
    print("result.status_code=", result.status_code)
    parser = etree.HTMLParser()
    dom = etree.fromstring(result.content, parser)

    #prove that the dom variable works like it did before 
    links = dom.cssselect('a')
    for link in links:
       print ("Link:", link.text)
except Exception as e:
    # TODO - log this into some other error table to come back and research
    errMsg = f"Error: {e} "
    print(errMsg)

Upvotes: 0

Related Questions