gloopy
gloopy

Reputation: 113

Passing lxml output to BeautifulSoup

My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.

#! /usr/bin/python
import urllib.request 
import urllib.error 
from io import StringIO
from bs4 import BeautifulSoup 
from lxml import etree 
from lxml import html 

file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly

With that working, I tried to feed it a page via urllib:

# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes

Trying to deal with the error message, I tried:

# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file

I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:

attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer

This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.

python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.

Upvotes: 1

Views: 2712

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121256

BeautifulSoup can use the lxml parser directly, no need to go to these lengths.

BeautifulSoup(doc, 'lxml')

Upvotes: 3

Related Questions