How to parse a large malformed HTML page, in Python?

Question

I'm trying to parse a large HTML page with a malformed table markup. There are around 7000-10000 rows in the table. The problem is that none of the tr, th, td is closed. So, the markup is like this:








    
        Dummy content
        
A
            B
            C
            D
            E
            F
            G


        
A
            B
            C
            D
            E
        
A
            B
            C
            D
            E
    .........
    .........

I tried BeautifulSoup.prettify() to fix it, but BeautifulSoup runs in to a maximum recursion depth error. Also tried with lxml, as follows:

from lxml import html
root = html.fromstring(htmltext)
print len(root.find('.//tr'))

But it returns a length of around 50, where there are actually above 7000 tr's.

Is there a good way to parse the HTML and extract content for each row?

omri_saadon · Accepted Answer

I hope you are looking for something like this.

import re
trs = re.findall(r'(?<=).*?(?=)', your_string, re.DOTALL)
print trs

this regex will return everything between two tr labels. if you want to search between two other labels, just change the first tr and the second tr to the thing you need.

i ran a little test and it worked for me, let me know if it helped you.

How to parse a large malformed HTML page, in Python?

Answers (2)

Related Questions