Reputation: 304
I would like to parse large HTML files and extract information from those files through xpath. Aiming to do that, I'm using python and lxml. However, lxml seems not to work well with large files, it can parse correctly files whose size isn't larger than around 16 MB. The fragment of code where it tries to extract information from HTML code though xpath is the following:
tree = lxml.html.fragment_fromstring(htmlCode)
links = tree.xpath("//*[contains(@id, 'item')]/div/div[2]/p/text()")
The variable htmlCode contains the HTML code read from a file. I also tried using parse method for reading the code from file instead of getting the code directly from a string, but it didn't work either. As the contents of file is read successfully from file, I think the problem is related to lxml. I've been looking for another libraries in order to parse HTML and use xpath, but it looks like lxml is the main library used for that.
Is there another method/function of lxml that deals better with large HTML files?
Upvotes: 3
Views: 2162
Reputation: 51
If the file is very large, you can use iterparse and add html=True argument to parse files without any validation. You need to manually create conditions for xpath.
from lxml import etree
import sys
import unicodedata
TAG = '{http://www.mediawiki.org/xml/export-0.8/}text'
def fast_iter(context, func, *args, **kwargs):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
# modified to call func() only in the event and elem needed
for event, elem in context:
if event == 'end' and elem.tag == TAG:
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def process_element(elem, fout):
global counter
normalized = unicodedata.normalize('NFKD', \
unicode(elem.text)).encode('ASCII','ignore').lower()
print >>fout, normalized.replace('\n', ' ')
if counter % 10000 == 0: print "Doc " + str(counter)
counter += 1
def main():
fin = open("large_file", 'r')
fout = open('output.txt', 'w')
context = etree.iterparse(fin,html=True)
global counter
counter = 0
fast_iter(context, process_element, fout)
if __name__ == "__main__":
main()
Upvotes: 2