Pigna
Pigna

Reputation: 2924

How to know what's the right encode?

I've decided to learn C++ and I really like the site www.learncpp.com. Now, I would like to make a pdf version of it and print it, so that I can read it on paper. First I've built an url-collector of all the chapters in the site. It works fine.

Now I'm working on creating an html out of the first chapter. I wrote the following:

import requests
from bs4 import BeautifulSoup
import codecs

req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/")
soup = BeautifulSoup(req.text,'lxml')

content = soup.find("div", class_="post-9")

f = open("first_lesson.html","w")
f.write(content.prettify().encode('utf-8'))
f.close()

and I got my first_lesson.html file in the folder. Problem is that when I open the html file to check the result, there are weird symbols (try to run the code and see) here and there.

I added .encode('utf-8') because otherwise I would get the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 155: ordinal not in range(128)

How to eliminate those weird symbols? Whats the right encoding? And, in case I got similar problems in the future, how can I know what is the right encoding?

UPDATE: instead of encoding in 'utf-8' I encoded in 'windows-1252' and it worked. But what is the best strategy to understand how to properly encode? Cause I don't think try-this-try-that is a good one

Upvotes: 1

Views: 165

Answers (2)

timgeb
timgeb

Reputation: 78750

content.prettify() is a unicode string. It happens that among others it contains the code point U+2014 which maps to the character — (EM DASH). The ASCII codec cannot encode it, because 8212=0x2014 is larger than 127.

You can however encode your unicode string with any encoding that can handle unicode code points, for example utf-16, utf-32, ucs-2, ucs-4 or ucs-8. There is no "right" encoding, however utf-8 is the king of them, so usually it is a good choice when you want to encode a unicode string, but you could have chosen another one (that python supports) and your program would - for example - also work with

f.write(content.prettify().encode('utf-16'))

prettify gives you a unicode string and per default tries the decoding with utf-8 (that's what I understand from having a look at the source), but you can give prettify an explicit encoding to work with as an argument. Think of unicode strings as an abstraction, a series of unicode code points which basically corresponds to a series of characters (which are nothing but small images).

If you ever need to find the content-type of a HTML document with beautifulsoup you may find this and this question useful.

Another point: In general, whenever you have plain bytes and nobody tells you how they are supposed to be decoded, you are out of luck and have to play whack-a-mole. If you know that you are dealing with text, utf-8 is usually a good first guess because it is a) widely used and b) the first 128 unicode characters correspond one-to-one with ASCII and utf-8 encodes them with the same byte values.

You may also find this chartable and this talk from PyCon 2012 useful.

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180481

Using requests in python2 you should use .content to let requests take care of the encoding, you can use io.open to write to the file:

import requests
from bs4 import BeautifulSoup
import io


req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/")
soup = BeautifulSoup(req.content, 'lxml')
content = soup.find("div", class_="post-9")

with io.open("first_lesson.html", "w") as f:
    f.write(soup.prettify())

If you did want to specify the encoding, prettify takes an encoding argument soup.prettify(encoding=...), there is also the encoding attribute:

enc = req.encoding

You can parse try parsing the header with cgi.parse_headers:

import cgi

enc = cgi.parse_header(req.headers.get('content-type', ""))[1]["charset"]

Or try installing and using chardet module:

import chardet

enc = chardet.detect(req.content)

You should also be aware that many encodings may run without error but you will end up with garbage in the file. The charset is set to utf-8, you can see it in the headers returned and if you look at the source you can see <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />.

Upvotes: 1

Related Questions