Chinese Unicode issue?

Question

From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31


百度汇总

I'm scraping the text and trying to get 百度汇总

but when I r.encoding = 'utf-8' the result is �ٶȻ��

if I don't use utf-8 the result is °Ù¶È»ã×Ü

Martijn Pieters · Accepted Answer

The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:

GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.

You can decode it with GBK just fine:

>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True

Decoding with gb2313 fails:

>>> r.content.decode('gb2312')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence

but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.

If you are using requests, then setting r.encoding to gb2312 works because r.text uses replace when handling decode errors:

content = str(self.content, encoding, errors='replace')

so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.

Note that BeautifulSoup can do the decoding all by itself; it'll find the meta header:

>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

The warning is caused by the GBK codepoints being used while the page claims to use GB2312.

Chinese Unicode issue?

Answers (1)

Related Questions