Reputation: 49
I'm new on stackoverflow and this is my first question.
I'm writing script in Python for parsing HTML page. Page looks like this:
<TABLE style="border: 1px solid black">
<TR>
<TD colspan="2"><span id="text1" style="color: white">DATA1</span></TD>
</TR>
<TR>
<TD class="rowLabel" valign="top">Data name</TD>
<TD valign="top" width="100"><span id="somename1" class="alsoname">DATA2</span></TD>
</TR>
<TR>
<TD class="rowLabel" valign="top">Data name</TD>
<TD valign="top" width="100"><span id="somename2" class="alsoname">DATA3</span></TD>
</TR>
<TR>
<TD class="rowLabel" valign="top">Data name</TD>
<TD valign="top" width="100"><span id="somename3" class="alsoname">DATA4</span></TD>
</TR>
<TR>
<TD class="rowLabel" valign="top">Data name</TD>
<TD valign="top" width="100"><span id="somename4" class="alsoname">DATA5</span></TD>
</TR>
<TR>
<TD class="rowLabel" valign="top">Data name</TD>
<TD valign="top" width="100"><span id="somename5" class="alsoname">DATA6</span></TD>
</TR>
<TR>
<TD class="rowLabel" valign="top">Data name</TD>
<TD valign="top" width="100"><span id="somename6" class="alsoname">DATA7</span></TD>
</TR>
<TR>
<TD class="rowLabel" valign="top">Data name</TD>
<TD valign="top" width="100"><span id="somename7" class="alsoname">DATA8</span></TD>
</TR>
I would like to collect DATA values from brackets based on span id name. If span ID == somename1 then put it's DATA value in variable.
so far I have this code:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'span':
for name, value in attrs:
if name == 'id' and value == 'somename1':
print 'ID', value
elif name == 'id' and value == 'somename2':
print 'ID', value
elif name == 'id' and value == 'somename3':
print 'ID', value
else :
print 'NO DATA'
p = MyHTMLParser()
p.feed(flush)
Can anybody help me?
Upvotes: 2
Views: 10830
Reputation: 35039
Overriding the handle_starttag
method is not enough. Unfortunately the basic HTMLParser
is not quite... usable in my opinion, maybe you have a look at BeautifulSoup. You could do it like this:
class MyHTMLParser(HTMLParser):
def __init__(self):
self.collect_data = False
self.tagname = None
self.id = None
def handle_starttag(self, tag, attrs):
if tag == 'span':
for name, value in attrs:
if name == 'id' and value == 'somename1':
self.collect_data = True
self.tagname = tag
self.id = value
def handle_data(self, data):
if self.collect_data:
self.somevar = data
self.collect_data = False
print "Tag: %s ID: %s" % (self.tagname, self.id)
print "Data: %s" % data
With the collect_data
we state that we want to put the next data incoming (in the handle_data
method) into a variable. We turn this boolean on, when id
is somename1
and turn it off, when we have collected the data. Not really beautiful, isn't it?
Upvotes: 0
Reputation: 39
I find that using BeautifulSoup with any sort of HTML is much easier.
from BeautifulSoup import BeautifulSoup as bs
from urllib2 import urlopen
data = urlopen('wherever').read()
soup = bs(data)
for span in soup.findAll('span'):
print span['id'], span.text
You may have to refine some parts of it, since you only provided a table.
Upvotes: 2