Python crawler: downloading HTML page

Question

I want to crawl (gently) a website and download each HTML page that I crawl. To accomplish that I use the library requests. I already did my crawl-listing and I try to crawl them using urllib.open but without user-agent, I get an error message. So I choose to use requests, but I don't really know how to use it.

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'
}
page = requests.get('http://www.xf.com/ranking/get/?Amount=1&From=left&To=right', headers=headers)
with open('pages/test.html', 'w') as outfile:
     outfile.write(page.text)

The problem is when the script try to write the response in my file I get some encoding error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 6673-6675: ordinal not in range(128)

How can we write in a file without having those encoding problem?

Martijn Pieters · Accepted Answer

In Python 2, text files don't accept Unicode strings. Use the response.content to access the original binary, undecoded content:

with open('pages/test.html', 'w') as outfile:
    outfile.write(page.content)

This will write the downloaded HTML in the original encoding as served by the website.

Alternatively, if you want to re-encode all responses to a specific encoding, use io.open() to produce a file object that does accept Unicode:

import io

with io.open('pages/test.html', 'w', encoding='utf8') as outfile:
    outfile.write(page.text)

Note that many websites rely on signalling the correct codec in the HTML tags, and the content can be served without a characterset parameter altogether.

In that case requests uses the default codec for the text/* mimetype, Latin-1, to decode HTML to Unicode text. This is often the wrong codec and relying on this behaviour can lead to Mojibake output later on. I recommend you stick to writing the binary content and rely on tools like BeautifulSoup to detect the correct encoding later on.

Alternatively, test explicitly for the charset parameter being present and only re-encode (via response.text and io.open() or otherwise) if requests did not fall back to the Latin-1 default. See retrieve links from web page using python and BeautifulSoup for an answer where I use such a method to tell BeautifulSoup what codec to use.

Python crawler: downloading HTML page

Answers (2)

Related Questions