gsb
gsb

Reputation: 5640

Parsing html tags with Python

I have been given an url and I want to extract the contents of the <BODY> tag from the url. I'm using Python3. I came across sgmllib but it is not available for Python3.

Can someone please guide me with this? Can I use HTMLParser for this?

Here is what i tried:

import urllib.request
f=urllib.request.urlopen("URL")
s=f.read()

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)

parser = MyHTMLParser()
parser.feed(s)

this gives me error : TypeError: Can't convert 'bytes' object to str implicitly

Upvotes: 5

Views: 4670

Answers (2)

RanRag
RanRag

Reputation: 49567

If you take a look at your s variable its type is byte.

>>> type(s)
<class 'bytes'>

and if you take a look at Parser.feed it requires a string or unicode as an argument.So,do

>>> x = s.decode('utf-8')
>>> type(x)
<class 'str'>
>>> parser.feed(x)

or do x = str(s).

Upvotes: 4

pycoder112358
pycoder112358

Reputation: 875

To fix the TypeError change line #3 to

s = str(f.read())

The web page you're getting is being returned in the form of bytes, and you need to change the bytes into a string to feed them to the parser.

Upvotes: 10

Related Questions