Mismatched Tag Error While Parsing XML?

Question

I'm writing this script that downloads an HTML document from http://example.com/ and attempts to parse it as an XML by using:

with urllib.request.urlopen("http://example.com/") as f:
    tree = xml.etree.ElementTree.parse(f)

However, I keep getting a ParseError: mismatched tag error, supposedly at line 1, column 2781, so I donwloaded the file manually (Ctrl+S on my browser) and checked it, but such position indicates a place in the middle of a string, and not even near the EOF, but there were a few lines before the actual 2781nth character so that might've messed up my calculation of the exact position. However, I tried to download and actually write the response to a file to parse it later by:

response = urllib.request.urlopen("http://example.com/")
f = open("test.html", "wb")
f.write(response.read())
f.close()
html = open("test.html", "r")
tree = xml.etree.ElementTree.parse(html)

And I'm still getting the same mismatched tag error at the same column, but this time I opened the downloaded html and the only stuff near column 2781 is this:

;



And the exact 2781nth column marks the first "h" in , so what could be wrong here? am I missing something?

Edit:

I've been looking more into it and tried to parse the XML using another parser, this time minidom, but I'm still getting the exact same error at the exact same line, what could be the problem here? this also happens even though I've downloaded the file by several different ways (urllib, curl, wget, even Ctrl+Save on the browser) and the result is the same.

Edit 2:

This is what I've tried so far:

This is an example xml I just got from the API doc, and saved it to text.html:


    
        Example page
    
    
        Moved to example.org
        or example.com.
    



And I tried:

with urllib.request.urlopen("text.html") as f:
    tree = xml.etree.ElementTree.parse(f)


And it works, then:

with urllib.request.urlopen("text.html") as f:
    tree = xml.etree.ElementTree.fromstring(f.read())


And it also works, but:

with urllib.request.urlopen("http://example.com/") as f:
    xml.etree.ElementTree.parse(f)


Doesn't, also tried:

with urllib.request.urlopen("http://example.com/") as f:
    xml.etree.ElementTree.fromstring(f.read())


And it doesn't work too, what could be the problem? as far as I can tell the document doesn't have mismatching tags, but perhaps it's too large? it's only 95.2 KB.

Stephen Lin · Accepted Answer

You can use bs4 to parse this page. Like this:

import bs4
import urllib


url = 'http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl'
proxies = {'http': 'http://www-proxy.ericsson.se:8080'}
f = urllib.urlopen(url, proxies=proxies)
info = f.read()
soup = bs4.BeautifulSoup(info)
print soup.a

OUTPUT:

You can download bs4 from this link.

Mismatched Tag Error While Parsing XML?

Answers (2)

Related Questions