Praful Bagai
Praful Bagai

Reputation: 17382

How to parse HTML string into HTML DOM elements in python?

I've a string of HTML elements

HTMLstr = """
    <div class='column span4 ui-sortable' id='column1'></div>
    <div class='column span4 ui-sortable' id='column2'>
        <div class='portlet ui-widget ui-widget-content ui-helper-clearfix ui-corner-all' id='widget_basicLine'>
        <div class='portlet-header ui-widget-header ui-corner-all'><span class='ui-icon ui-icon-minusthick'></span>Line Chart </div>
        <div class='portlet-content' id=basicLine style='height:270px; margin: 0 auto;'></div>          
        </div>
    </div>
    <div class='column span4 ui-sortable' id='column3'></div> """

I want to convert the above HTML string into respective HTML DOM elements in python?

I can do it in jQuery/AJAX function via $(this).html(HTMLstr); but how do I parse it in python?

Upvotes: 3

Views: 11804

Answers (2)

Ben S.
Ben S.

Reputation: 1143

Python has built-in libraries for parsing HTML documents. In Python 2.x, you have your choice of HTMLParser (recommended) and htmllib (deprecated); in Python 3.x, html.parser is the appropriate library (this is a renamed version of HTMLParser from Python 2.x).

However, these are event-driven parsers (similar to XML SAX parsers), which may not be what you want. An alternative would be using one of Python's XML parsing tools, if you know that the document is going to be valid XML (i.e. tags properly closed, etc.). The libraries xml.dom and xml.dom.minidom are both options, depending on what kind of parsing you're looking for (I suspect xml.dom.minidom is sufficient for your purposes, given your example).

For example, you should be able to enter this in your Python console and get the output shown:

>>> import xml.dom.minidom
>>> x = xml.dom.minidom.parseString('<div class="column span4 ui-sortable" id="column2"><div class="portlet ui-widget ui-widget-content ui-helper-clearfix ui-corner-all" id="widget_basicLine" /></div>')
>>> x.documentElement.nodeName
'div'
>>> x.documentElement.getAttribute("class")
'column span4 ui-sortable'
>>> len(x.documentElement.firstChild.childNodes)
0

A full description of the Node objects you receive is available here. If you're used to using the DOM in JavaScript, you should find that most of the properties are the same. Note that because Python treats this as an XML document, HTML-specific properties like 'class' have no special significance, so I believe you have to use the getAttribute function to access them.

Upvotes: 6

Ofir Israel
Ofir Israel

Reputation: 3913

You should use BeautifulSoup -- does exactly what you need.

http://www.crummy.com/software/BeautifulSoup/

Upvotes: 2

Related Questions