Michael VEE
Michael VEE

Reputation: 117

Python parsing html mismatched tag error

30   <li class="start_1">
31               <input type="checkbox" name="word_ids[]" value="34" class="list_check">
32          </li> 

This is a part of html file that I want to parse. But when I applied

uh = open('1.htm','r')
data = uh.read()
print data  
tree = ET.fromstring(data)

It showed

xml.etree.ElementTree.ParseError: mismatched tag: line 32, column 18

I don't know what is going wrong?

Upvotes: 0

Views: 3200

Answers (2)

Mirko Conti
Mirko Conti

Reputation: 618

To parse HTML in Python i use lxml:

import lxml.html
// html string
dom = '<li class="start_1">...</li>'
// get the root node
root_node = lxml.html.fromstring(dom)

after that you can play with it, for example using xpath:

nodes = root_node.xpath("//*")

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1121168

You are trying to parse HTML with an XML parser; the latter doesn't have a concept of <input> not having a closing tag.

Use an actual HTML parser; if you want to access the result with an ElementTree-compatible API, use the lxml project, which includes an HTML parser. Otherwise, use BeautifulSoup (which can use lxml under the hood as the parsing engine).

Upvotes: 1

Related Questions