Reputation: 9115
I'm writing this script that downloads an HTML document from http://example.com/ and attempts to parse it as an XML by using:
with urllib.request.urlopen("http://example.com/") as f:
tree = xml.etree.ElementTree.parse(f)
However, I keep getting a ParseError: mismatched tag
error, supposedly at line 1, column 2781, so I donwloaded the file manually (Ctrl+S on my browser) and checked it, but such position indicates a place in the middle of a string, and not even near the EOF, but there were a few lines before the actual 2781nth character so that might've messed up my calculation of the exact position. However, I tried to download and actually write the response to a file to parse it later by:
response = urllib.request.urlopen("http://example.com/")
f = open("test.html", "wb")
f.write(response.read())
f.close()
html = open("test.html", "r")
tree = xml.etree.ElementTree.parse(html)
And I'm still getting the same mismatched tag
error at the same column, but this time I opened the downloaded html and the only stuff near column 2781 is this:
;</script></head><body class
And the exact 2781nth column marks the first "h" in </head>
, so what could be wrong here? am I missing something?
Edit:
I've been looking more into it and tried to parse the XML using another parser, this time minidom, but I'm still getting the exact same error at the exact same line, what could be the problem here? this also happens even though I've downloaded the file by several different ways (urllib, curl, wget, even Ctrl+Save on the browser) and the result is the same.
Edit 2:
This is what I've tried so far:
This is an example xml I just got from the API doc, and saved it to text.html:
<html>
<head>
<title>Example page</title>
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>
or <a href="http://example.com/">example.com</a>.</p>
</body>
</html>
And I tried:
with urllib.request.urlopen("text.html") as f:
tree = xml.etree.ElementTree.parse(f)
And it works, then:
with urllib.request.urlopen("text.html") as f:
tree = xml.etree.ElementTree.fromstring(f.read())
And it also works, but:
with urllib.request.urlopen("http://example.com/") as f:
xml.etree.ElementTree.parse(f)
Doesn't, also tried:
with urllib.request.urlopen("http://example.com/") as f:
xml.etree.ElementTree.fromstring(f.read())
And it doesn't work too, what could be the problem? as far as I can tell the document doesn't have mismatching tags, but perhaps it's too large? it's only 95.2 KB.
Upvotes: 1
Views: 16987
Reputation: 4912
You can use bs4
to parse this page. Like this:
import bs4
import urllib
url = 'http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl'
proxies = {'http': 'http://www-proxy.ericsson.se:8080'}
f = urllib.urlopen(url, proxies=proxies)
info = f.read()
soup = bs4.BeautifulSoup(info)
print soup.a
OUTPUT:
<a href="/a/" title="Anime & Manga">a</a>
You can download bs4 from this link.
Upvotes: 2
Reputation: 7889
Based on the urllib and ElementTree documentation, this code snippet seemed to work without error for your sample URL.
import urllib.request
import xml.etree.ElementTree as ET
with urllib.request.urlopen('http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl') as response:
html = response.read()
tree = ET.parse(html)
If you don't want to read the response into a variable before parsing it with ElementTree, this also works:
with urllib.request.urlopen('http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl') as response:
tree = ET.parse(response.read())
Upvotes: 0