Reputation: 117
30 <li class="start_1">
31 <input type="checkbox" name="word_ids[]" value="34" class="list_check">
32 </li>
This is a part of html file that I want to parse. But when I applied
uh = open('1.htm','r')
data = uh.read()
print data
tree = ET.fromstring(data)
It showed
xml.etree.ElementTree.ParseError: mismatched tag: line 32, column 18
I don't know what is going wrong?
Upvotes: 0
Views: 3200
Reputation: 618
To parse HTML in Python i use lxml:
import lxml.html
// html string
dom = '<li class="start_1">...</li>'
// get the root node
root_node = lxml.html.fromstring(dom)
after that you can play with it, for example using xpath:
nodes = root_node.xpath("//*")
Upvotes: 1
Reputation: 1121168
You are trying to parse HTML with an XML parser; the latter doesn't have a concept of <input>
not having a closing tag.
Use an actual HTML parser; if you want to access the result with an ElementTree-compatible API, use the lxml
project, which includes an HTML parser. Otherwise, use BeautifulSoup (which can use lxml
under the hood as the parsing engine).
Upvotes: 1