Vinay
Vinay

Reputation: 470

Extract data using HTMLParser

<tr>
  <td style="color: #0000FF;text-align: center"><p>Sam<br/>John<br/></p></td>
</tr>

I am using the python HTMLParser module to extract the values Sam and John from the below html snippet, but the handle_data function is capturing only Sam and not John.

How I can get both Sam and John?

Upvotes: 2

Views: 1736

Answers (1)

alecxe
alecxe

Reputation: 473863

You can have an instance-level variable that would have True/False values. Set it to True if p tag started, False if p tag ended. When the value is True, get the data in the handle_data() method:

from HTMLParser import HTMLParser

data = """
<tr>
  <td style="color: #0000FF;text-align: center"><p>Sam<br/>John<br/></p></td>
</tr>
"""

class Parser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.recording = False

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.recording = True

    def handle_endtag(self, tag):
        if tag == 'p':
            self.recording = False

    def handle_data(self, data):
        if self.recording:
            print data

parser = Parser()
parser.feed(data)

Prints:

Sam
John

Upvotes: 4

Related Questions