Reputation: 2924
I've decided to learn C++ and I really like the site www.learncpp.com. Now, I would like to make a pdf version of it and print it, so that I can read it on paper. First I've built an url-collector of all the chapters in the site. It works fine.
Now I'm working on creating an html out of the first chapter. I wrote the following:
import requests
from bs4 import BeautifulSoup
import codecs
req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/")
soup = BeautifulSoup(req.text,'lxml')
content = soup.find("div", class_="post-9")
f = open("first_lesson.html","w")
f.write(content.prettify().encode('utf-8'))
f.close()
and I got my first_lesson.html
file in the folder.
Problem is that when I open the html file to check the result, there are weird symbols (try to run the code and see) here and there.
I added .encode('utf-8')
because otherwise I would get the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 155: ordinal not in range(128)
How to eliminate those weird symbols? Whats the right encoding? And, in case I got similar problems in the future, how can I know what is the right encoding?
UPDATE: instead of encoding in 'utf-8' I encoded in 'windows-1252' and it worked. But what is the best strategy to understand how to properly encode? Cause I don't think try-this-try-that is a good one
Upvotes: 1
Views: 165
Reputation: 78750
content.prettify()
is a unicode string. It happens that among others it contains the code point U+2014 which maps to the character — (EM DASH). The ASCII codec cannot encode it, because 8212=0x2014 is larger than 127.
You can however encode your unicode string with any encoding that can handle unicode code points, for example utf-16, utf-32, ucs-2, ucs-4 or ucs-8. There is no "right" encoding, however utf-8 is the king of them, so usually it is a good choice when you want to encode a unicode string, but you could have chosen another one (that python supports) and your program would - for example - also work with
f.write(content.prettify().encode('utf-16'))
prettify
gives you a unicode string and per default tries the decoding with utf-8 (that's what I understand from having a look at the source), but you can give prettify
an explicit encoding to work with as an argument. Think of unicode strings as an abstraction, a series of unicode code points which basically corresponds to a series of characters (which are nothing but small images).
If you ever need to find the content-type of a HTML document with beautifulsoup you may find this and this question useful.
Another point: In general, whenever you have plain bytes and nobody tells you how they are supposed to be decoded, you are out of luck and have to play whack-a-mole. If you know that you are dealing with text, utf-8 is usually a good first guess because it is a) widely used and b) the first 128 unicode characters correspond one-to-one with ASCII and utf-8 encodes them with the same byte values.
You may also find this chartable and this talk from PyCon 2012 useful.
Upvotes: 0
Reputation: 180481
Using requests in python2 you should use .content
to let requests take care of the encoding, you can use io.open to write to the file:
import requests
from bs4 import BeautifulSoup
import io
req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/")
soup = BeautifulSoup(req.content, 'lxml')
content = soup.find("div", class_="post-9")
with io.open("first_lesson.html", "w") as f:
f.write(soup.prettify())
If you did want to specify the encoding, prettify takes an encoding argument soup.prettify(encoding=...)
, there is also the encoding attribute:
enc = req.encoding
You can parse try parsing the header with cgi.parse_headers:
import cgi
enc = cgi.parse_header(req.headers.get('content-type', ""))[1]["charset"]
Or try installing and using chardet module:
import chardet
enc = chardet.detect(req.content)
You should also be aware that many encodings may run without error but you will end up with garbage in the file. The charset is set to utf-8, you can see it in the headers returned and if you look at the source you can see <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
.
Upvotes: 1