Reputation: 113
My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.
#! /usr/bin/python
import urllib.request
import urllib.error
from io import StringIO
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly
With that working, I tried to feed it a page via urllib:
# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes
Trying to deal with the error message, I tried:
# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file
I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:
attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer
This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.
python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.
Upvotes: 1
Views: 2712
Reputation: 1121256
BeautifulSoup can use the lxml
parser directly, no need to go to these lengths.
BeautifulSoup(doc, 'lxml')
Upvotes: 3