Reputation: 437

Converting html source content into readable format with Python 2.x

Python 2.7

I have a program that gets video titles from the source code of a webpage but the titles are encoded in some HTML format.

This is what I've tried so far:

>>> import urllib2
>>> urllib2.unquote('&pound;')
'&pound;'

So that didn't work... Then I tried:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('&pound;')
u'\xa3'

as you can see that doesn't work either nor any combination of the two.

I managed to find out that '£' is an HTML character entity name. The '\xa3' I wasn't able to find out.

Does anyone know how to do this, how to convert HTML content into a readable format in python?

Upvotes: 1

Answers (4)

starenka

Reputation: 580

lxml, BeautifulSoup or PyQuery does the job pretty well. Or combination of these ;)

Upvotes: 0

Francis Avila

Reputation: 31641

£ is the html character entity for the POUND SIGN, which is unicode character U+00A3. You can see this if you print it:

>>> print u'\xa3'
£

When you use unescape(), you converted the character entity to it's native unicode character, which is what u'\xa3' means--a single U+00A3 unicode character.

If you want to encode this into another format (e.g. utf-8), you would do so with the encode method of strings:

>>> u'\xa3'.encode('utf-8')
'\xc2\xa3'

You get a two-byte string representing the single "POUND SIGN" character.

I suspect that you are a bit unclear about how string encodings work in general. You need to convert your string from bytes to unicode (see this answer for one way to do that with urllib2), then unescape the html, then (possibly) convert the unicode into whatever output encoding you need.

Upvotes: 1

Josh Rosen

Reputation: 13841

The video title strings use HTML entities to encode special characters, such as ampersands and pound signs.

The \xa3 is the Python Unicode character literal for the pound sign (£). In your example, Python is displaying the __repr__() of a Unicode string, which is why you see the escapes. If you print this string, you can see it represents the pound sign:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('&pound;')
u'\xa3'
>>> print h.unescape('&pound;')
£

Upvotes: 1

dav1d

Reputation: 6055

Why doesn't that work?

In [1]: s = u'\xa3'

In [2]: s
Out[2]: u'\xa3'

In [3]: print s
£

When it comes to unescaping html entities I always used: http://effbot.org/zone/re-sub.htm#unescape-html.

Upvotes: 1

Converting html source content into readable format with Python 2.x

Answers (4)

Related Questions