Parsing and extracting information from large HTML files with python and lxml

Question

I would like to parse large HTML files and extract information from those files through xpath. Aiming to do that, I'm using python and lxml. However, lxml seems not to work well with large files, it can parse correctly files whose size isn't larger than around 16 MB. The fragment of code where it tries to extract information from HTML code though xpath is the following:

tree = lxml.html.fragment_fromstring(htmlCode)
links = tree.xpath("//*[contains(@id, 'item')]/div/div[2]/p/text()")

The variable htmlCode contains the HTML code read from a file. I also tried using parse method for reading the code from file instead of getting the code directly from a string, but it didn't work either. As the contents of file is read successfully from file, I think the problem is related to lxml. I've been looking for another libraries in order to parse HTML and use xpath, but it looks like lxml is the main library used for that.

Is there another method/function of lxml that deals better with large HTML files?

Parsing and extracting information from large HTML files with python and lxml

Answers (1)

Related Questions