stochastic_zeitgeist
stochastic_zeitgeist

Reputation: 1037

Parse specific links in html using HTMLParser in python?

I am trying to parse a particular set of links from a html file, but since I am using HTMLParser I cannot access information of the html in a Hierarchy Tree and hence I cannot extract the information.

My HTML is as follows :

<p class="mediatitle">
        <a class="bullet medialink" href="link/to/a/file">Some Content
        </a>
</p>

So what I need is to extract all the values which have its key as 'href' and the previous attribute as class="bullet medialink". In other words I want only thode hrefs which are present in a tag with of class 'bullet medialink'

What I tried so far is

from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    if(tag == 'a'):
        for (key,value) in attrs:
            if(value == 'bullet medialink'):
                print "attr:", key

p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()

Upvotes: 1

Views: 709

Answers (2)

Vincent Beltman
Vincent Beltman

Reputation: 2104

I would like Bs4 for this. Bs4 is a third party html parser. Documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

import urllib
from bs4 import BeautifulSoup

f = urllib.urlopen("sample.html")
html = f.read()
soup = BeautifulSoup(html)
for atag in soup.select('.bullet.medialink'):  # Just enter a css-selector here
    print atag['href']  # You can also get an atrriibute with atag.get('href')

Or shorter:

import urllib
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib.urlopen("sample.html").read())
for atag in soup.select('.bullet.medialink'):
    print atag

Upvotes: 1

stochastic_zeitgeist
stochastic_zeitgeist

Reputation: 1037

So I finally did it with a simple boolean flag owing to the fact that the HTMLParser isnt a hierarchical parser package.

Here's the code

from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    if(tag == 'a'):
        flag = 0
        for (key,value) in attrs:
                if(value == 'bullet medialink' and key == 'class'):
                    flag =1
                if(key == 'href' and flag == 1):    
                    print "link : ",value
                    flag = 0        

p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()

Hope someone comes up with a more elegant solution.

Upvotes: 0

Related Questions