Reputation: 17382
I've a string of HTML elements
HTMLstr = """
<div class='column span4 ui-sortable' id='column1'></div>
<div class='column span4 ui-sortable' id='column2'>
<div class='portlet ui-widget ui-widget-content ui-helper-clearfix ui-corner-all' id='widget_basicLine'>
<div class='portlet-header ui-widget-header ui-corner-all'><span class='ui-icon ui-icon-minusthick'></span>Line Chart </div>
<div class='portlet-content' id=basicLine style='height:270px; margin: 0 auto;'></div>
</div>
</div>
<div class='column span4 ui-sortable' id='column3'></div> """
I want to convert the above HTML string into respective HTML DOM elements in python?
I can do it in jQuery/AJAX function via $(this).html(HTMLstr);
but how do I parse it in python?
Upvotes: 3
Views: 11804
Reputation: 1143
Python has built-in libraries for parsing HTML documents. In Python 2.x, you have your choice of HTMLParser
(recommended) and htmllib
(deprecated); in Python 3.x, html.parser
is the appropriate library (this is a renamed version of HTMLParser
from Python 2.x).
However, these are event-driven parsers (similar to XML SAX parsers), which may not be what you want. An alternative would be using one of Python's XML parsing tools, if you know that the document is going to be valid XML (i.e. tags properly closed, etc.). The libraries xml.dom
and xml.dom.minidom
are both options, depending on what kind of parsing you're looking for (I suspect xml.dom.minidom
is sufficient for your purposes, given your example).
For example, you should be able to enter this in your Python console and get the output shown:
>>> import xml.dom.minidom
>>> x = xml.dom.minidom.parseString('<div class="column span4 ui-sortable" id="column2"><div class="portlet ui-widget ui-widget-content ui-helper-clearfix ui-corner-all" id="widget_basicLine" /></div>')
>>> x.documentElement.nodeName
'div'
>>> x.documentElement.getAttribute("class")
'column span4 ui-sortable'
>>> len(x.documentElement.firstChild.childNodes)
0
A full description of the Node objects you receive is available here. If you're used to using the DOM in JavaScript, you should find that most of the properties are the same. Note that because Python treats this as an XML document, HTML-specific properties like 'class' have no special significance, so I believe you have to use the getAttribute
function to access them.
Upvotes: 6
Reputation: 3913
You should use BeautifulSoup -- does exactly what you need.
http://www.crummy.com/software/BeautifulSoup/
Upvotes: 2