Reputation: 471
I want to extract tag from an html file in python without using BeautifulSoup. For example, I want to get
class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine
from
<a class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine</a>
Any ideas?
Upvotes: 2
Views: 130
Reputation: 720
Have a look at this XML API provided in python, it explains how to access attributes , elements and has some HTML examples too. You can also generate parser objects.
Upvotes: 1
Reputation: 6430
For doing basic dom parsing, you can use the xml parser in the stl.
here is an example of turning xml into html using it (from the docs):
import xml.dom.minidom
document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>
<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""
dom = xml.dom.minidom.parseString(document)
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
def handleSlideshow(slideshow):
print "<html>"
handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
slides = slideshow.getElementsByTagName("slide")
handleToc(slides)
handleSlides(slides)
print "</html>"
def handleSlides(slides):
for slide in slides:
handleSlide(slide)
def handleSlide(slide):
handleSlideTitle(slide.getElementsByTagName("title")[0])
handlePoints(slide.getElementsByTagName("point"))
def handleSlideshowTitle(title):
print "<title>%s</title>" % getText(title.childNodes)
def handleSlideTitle(title):
print "<h2>%s</h2>" % getText(title.childNodes)
def handlePoints(points):
print "<ul>"
for point in points:
handlePoint(point)
print "</ul>"
def handlePoint(point):
print "<li>%s</li>" % getText(point.childNodes)
def handleToc(slides):
for slide in slides:
title = slide.getElementsByTagName("title")[0]
print "<p>%s</p>" % getText(title.childNodes)
handleSlideshow(dom)
Upvotes: 1