Reputation: 50567
I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) using lxml.etree.iterparse
:
Incremental parser. Parses XML into a tree and generates tuples (event, element) in a SAX-like fashion
I am using an incremental/iterative/SAX approach to reduce the amount of memory used (I don't want to load the HTML into a DOM/tree because the file is large)
The problem I'm having is that I'm getting XML syntax errors such as:
lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59
This then causes everything to stop.
Is there a way to iteratively parse HTML without choking on syntax errors?
At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document, and then restarting the process. Seems like a pretty disgusting solution. Is there a better way?
Edit:
This is what I'm currently doing:
context = etree.iterparse(tfile, events=('start', 'end'), html=True)
in_table = False
header_row = True
while context:
try:
event, el = context.next()
# do something
# remove old elements
while el.getprevious() is not None:
del el.getparent()[0]
except etree.XMLSyntaxError, e:
print e.msg
lineno = int(re.search(r'line (\d+),', e.msg).group(1))
remove_line(tfilename, lineno)
tfile = open(tfilename)
context = etree.iterparse(tfile, events=('start', 'end'), html=True)
except KeyError:
print 'oops keyerror'
Upvotes: 8
Views: 3658
Reputation: 526
Sorry for rehashing an old question, but for late comers who are searching for solutions, lxml
version 3.3 has HTMLPullParser
and XMLPullParser
which parse incrementally. One can also check out lxml
introduction to parsing for more examples.
If you want to parse a very large document and save memory, you can write a custom target class as event handler to avoid building the element tree. Something like:
class MyParserTarget:
def start(self, tag, attrib) -> None:
# do something
def end(self, tag) -> None:
# do something
def data(self, data) -> None:
# do something
def close(self):
# return your result
mytarget = MyParserTarget()
parser = lxml.etree.HTMLPullParser(target=mytarget)
parser.feed(next(content))
# Do other stuff
result = parser.close()
If you continue to use etree.iterparse(..., html=True)
as in the OP question, it will use HtmlPullParser
under the hood., but iterparse
will not pass a custom target instance (like I show here), not even in the latest version of lxml
. Therefore if you prefer a custom target approach (vs events
argument as shown in OP), you can use HtmlPullParser
directly.
Upvotes: 1
Reputation: 7822
At the moment lxml etree.iterparse supports keyword argument recover=True, so that instead of writing custom subclass of HTMLParser fixing broken html you can just pass this argument to iterparse.
To properly parse huge and broken html you only need to do following:
etree.iterparse(tfile, events=('start', 'end'), html=True, recover=True)
Upvotes: 6
Reputation: 50567
The perfect solution ended up being Python's very own HTMLParser
[docs].
This is the (pretty bad) code I ended up using:
class MyParser(HTMLParser):
def __init__(self):
self.finished = False
self.in_table = False
self.in_row = False
self.in_cell = False
self.current_row = []
self.current_cell = ''
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if not self.in_table:
if tag == 'table':
if ('id' in attrs) and (attrs['id'] == 'dgResult'):
self.in_table = True
else:
if tag == 'tr':
self.in_row = True
elif tag == 'td':
self.in_cell = True
elif (tag == 'a') and (len(self.current_row) == 7):
url = attrs['href']
self.current_cell = url
def handle_endtag(self, tag):
if tag == 'tr':
if self.in_table:
if self.in_row:
self.in_row = False
print self.current_row
self.current_row = []
elif tag == 'td':
if self.in_table:
if self.in_cell:
self.in_cell = False
self.current_row.append(self.current_cell.strip())
self.current_cell = ''
elif (tag == 'table') and self.in_table:
self.finished = True
def handle_data(self, data):
if not len(self.current_row) == 7:
if self.in_cell:
self.current_cell += data
With that code I could then do this:
parser = MyParser()
for line in myfile:
parser.feed(line)
Upvotes: 12
Reputation: 559
Try parsing your HTML document with lxml.html:
Since version 2.0, lxml comes with a dedicated Python package for dealing with HTML: lxml.html. It is based on lxml's HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.
Upvotes: 0