HTML Parsing using Python

Question

I need to parse a webpage and extract some values from it. So I created a python parser as follow:

from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print "Data     :", data

f=open("result.html","r")
s=f.read()
parser = MyHTMLParser()
parser.feed(s)

The program reads html file and prints the data from it.

I passed following result.html, here parser works fine


Marks Obtained: 75.67 Out of 100


GATE Score: 911


All India Rank: 34

After passing the above html the output is:

Data :

Data : Marks Obtained:
Data : 75.67 Out of 100 Data :

Data :

Data :

Data : GATE Score:
Data : 911
Data :

Data :

Data :

Data : All India Rank:
Data : 34

But the parser is supposed to read a larger file and the code mentioned above is small part of that large file. The file is too large to paste here. So I uploaded it at following link: http://www.mediafire.com/?dsgr1gdjvs59c7c When passed the larger file, parser doesn't read all the entries leaving some blank entries in output. Part of output is shown below:

Data : Syllabi

Data :

Data : GATE Score

Data :

Data : GATE Results

Data :

Observe the blank entry in the line below Gate Score which was 911 in previous output.

The parser works fine with small file but not with the large file Why is this happening? I am using Python 2.7

MattH · Accepted Answer

My preferred solution for parsing HTML or XML is lxml and xpath.

A quick and dirty example of how you might use xpath:

from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
  print tr.xpath('./td/text()')

Yields:

['Registration Number: ', ' CS 2047103']
['Name of the Candidate: ', 'PATIL SANTOSH KUMARRAO        ']
['Examination Paper: ', 'CS - Computer Science and Information Technology']
['Marks Obtained: ', '75.67 Out of 100']
['GATE Score: ', '911']
['All India Rank: ', '34']
['No of Candidates Appeared in CS: ', '156780']
['Qualifying Marks for CS: ', '
					']
['General', 'OBC ', '(Non-Creamy)', 'SC / ST / PD ']
['31.54', '28.39', '21.03 ']

This code creates an ElementTree out of the HTML data. Using xpath, it selects all elements where there is an attribute of class="trmenu1". Then for each it selects and prints the text of any children.

HTML Parsing using Python

Answers (2)

Related Questions