Reputation: 5640
I have been given an url and I want to extract the contents of the <BODY>
tag from the url.
I'm using Python3. I came across sgmllib
but it is not available for Python3.
Can someone please guide me with this? Can I use HTMLParser
for this?
Here is what i tried:
import urllib.request
f=urllib.request.urlopen("URL")
s=f.read()
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print("Encountered some data:", data)
parser = MyHTMLParser()
parser.feed(s)
this gives me error : TypeError: Can't convert 'bytes' object to str implicitly
Upvotes: 5
Views: 4670
Reputation: 49567
If you take a look at your s
variable its type is byte.
>>> type(s)
<class 'bytes'>
and if you take a look at Parser.feed it requires a string or unicode as an argument.So,do
>>> x = s.decode('utf-8')
>>> type(x)
<class 'str'>
>>> parser.feed(x)
or do x = str(s)
.
Upvotes: 4
Reputation: 875
To fix the TypeError change line #3 to
s = str(f.read())
The web page you're getting is being returned in the form of bytes, and you need to change the bytes into a string to feed them to the parser.
Upvotes: 10