santosh-patil
santosh-patil

Reputation: 1550

HTML Parsing using Python

I need to parse a webpage and extract some values from it. So I created a python parser as follow:

from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print "Data     :", data

f=open("result.html","r")
s=f.read()
parser = MyHTMLParser()
parser.feed(s)

The program reads html file and prints the data from it.

I passed following result.html, here parser works fine

<tr class='trmenu1'>
<td>Marks Obtained: </td><td colspan=1>75.67 Out of 100</td>
</tr>
<tr class='trmenu1'>
<td>GATE Score: </td><td colspan=1>911</td>
</tr>
<tr class='trmenu1'>
<td>All India Rank: </td><td colspan=1>34</td>
</tr>

After passing the above html the output is:

Data :

Data : Marks Obtained:
Data : 75.67 Out of 100 Data :

Data :

Data :

Data : GATE Score:
Data : 911
Data :

Data :

Data :

Data : All India Rank:
Data : 34

But the parser is supposed to read a larger file and the code mentioned above is small part of that large file. The file is too large to paste here. So I uploaded it at following link: http://www.mediafire.com/?dsgr1gdjvs59c7c When passed the larger file, parser doesn't read all the entries leaving some blank entries in output. Part of output is shown below:

Data : Syllabi

Data :

Data : GATE Score

Data :

Data : GATE Results

Data :

Observe the blank entry in the line below Gate Score which was 911 in previous output.

The parser works fine with small file but not with the large file Why is this happening? I am using Python 2.7

Upvotes: 3

Views: 7776

Answers (2)

MattH
MattH

Reputation: 38247

My preferred solution for parsing HTML or XML is lxml and xpath.

A quick and dirty example of how you might use xpath:

from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
  print tr.xpath('./td/text()')

Yields:

['Registration Number: ', ' CS 2047103']
['Name of the Candidate: ', 'PATIL SANTOSH KUMARRAO        ']
['Examination Paper: ', 'CS - Computer Science and Information Technology']
['Marks Obtained: ', '75.67 Out of 100']
['GATE Score: ', '911']
['All India Rank: ', '34']
['No of Candidates Appeared in CS: ', '156780']
['Qualifying Marks for CS: ', '\r\n\t\t\t\t\t']
['General', 'OBC ', '(Non-Creamy)', 'SC / ST / PD ']
['31.54', '28.39', '21.03 ']

This code creates an ElementTree out of the HTML data. Using xpath, it selects all <tr> elements where there is an attribute of class="trmenu1". Then for each <tr> it selects and prints the text of any <td> children.

Upvotes: 8

yann.kmm
yann.kmm

Reputation: 837

If you look carefully at the html page on mediafire you'll notice that you have two text blocks that contain "GATE Score"

 line 162: <tr><td class='qlink4' background='webimages/blkbuttona3.jpg' onMouseOut="background='webimages/blkbuttona3.jpg'" onMouseOver="background='webimages/blkbuttonb3.jpg'">&nbsp;<a class="dark2" href="gscore.php" title="GATE Score">GATE Score</a></td></tr>

 line 192: <tr class='trmenu1'><td>GATE Score: </td><td colspan=1>911</td></tr>

The problem you are having is probably due to an error in the full html page you are trying to parse, that's why you can only see one "GATE Score" occurrence.

As you were suggested in the comments, use BeautifulSoup that is more tolerant of malformed html.

Upvotes: 2

Related Questions