How to know what's the right encode?

Question

I've decided to learn C++ and I really like the site www.learncpp.com. Now, I would like to make a pdf version of it and print it, so that I can read it on paper. First I've built an url-collector of all the chapters in the site. It works fine.

Now I'm working on creating an html out of the first chapter. I wrote the following:

import requests
from bs4 import BeautifulSoup
import codecs

req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/")
soup = BeautifulSoup(req.text,'lxml')

content = soup.find("div", class_="post-9")

f = open("first_lesson.html","w")
f.write(content.prettify().encode('utf-8'))
f.close()

and I got my first_lesson.html file in the folder. Problem is that when I open the html file to check the result, there are weird symbols (try to run the code and see) here and there.

I added .encode('utf-8') because otherwise I would get the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 155: ordinal not in range(128)

How to eliminate those weird symbols? Whats the right encoding? And, in case I got similar problems in the future, how can I know what is the right encoding?

UPDATE: instead of encoding in 'utf-8' I encoded in 'windows-1252' and it worked. But what is the best strategy to understand how to properly encode? Cause I don't think try-this-try-that is a good one

timgeb · Accepted Answer

content.prettify() is a unicode string. It happens that among others it contains the code point U+2014 which maps to the character — (EM DASH). The ASCII codec cannot encode it, because 8212=0x2014 is larger than 127.

You can however encode your unicode string with any encoding that can handle unicode code points, for example utf-16, utf-32, ucs-2, ucs-4 or ucs-8. There is no "right" encoding, however utf-8 is the king of them, so usually it is a good choice when you want to encode a unicode string, but you could have chosen another one (that python supports) and your program would - for example - also work with

f.write(content.prettify().encode('utf-16'))

prettify gives you a unicode string and per default tries the decoding with utf-8 (that's what I understand from having a look at the source), but you can give prettify an explicit encoding to work with as an argument. Think of unicode strings as an abstraction, a series of unicode code points which basically corresponds to a series of characters (which are nothing but small images).

If you ever need to find the content-type of a HTML document with beautifulsoup you may find this and this question useful.

Another point: In general, whenever you have plain bytes and nobody tells you how they are supposed to be decoded, you are out of luck and have to play whack-a-mole. If you know that you are dealing with text, utf-8 is usually a good first guess because it is a) widely used and b) the first 128 unicode characters correspond one-to-one with ASCII and utf-8 encodes them with the same byte values.

You may also find this chartable and this talk from PyCon 2012 useful.

How to know what's the right encode?

Answers (2)

Related Questions

How to know what&#39;s the right encode?

Answers (2)

Related Questions

How to know what's the right encode?