Reputation: 6849
How can a response's apparent_encoding
attribute be incorrect?
I have the below code snippet, demonstrates my question:
import requests
url = "https://item.jd.com/100000177760.html"
r = requests.get(url)
print(r.status_code, r.encoding) # 200, gbk (correct)
print(r.apparent_encoding) # GB2312 (wrong)
How does requests determine the response's characters encoding?
Upvotes: 2
Views: 3446
Reputation: 55844
Requests extracts the encoding from the response's Content-Type header's charset
parameter. If no charset
is found in the header and the content-type is of type "text", ISO-8859-1 (latin-1) is assumed. Otherwise the response's apparent_encoding
property is evaluated and used as the value of r.encoding
.
apparent_encoding
is determined by using the chardet library to guess the encoding of the response body.
In the case of the URL in the question, the encoding is declared in the Content-Type header
>>> r.headers['Content-Type']
'text/html; charset=gbk'
so r.apparent_encoding
is not evaluated until it is explicitly accessed by executing print(r.apparent_encoding)
.
In this particular case, chardet seems to get it wrong: the response's text attribute can be encoded with the gbk codec, but not with GB2312.
Upvotes: 4
Reputation: 15691
The requests
library can use the HTTP headers set on the response to figure out the response's encoding.
In your example:
url = "https://item.jd.com/100000177760.html"
r = requests.get(url)
print(r.headers)
with result:
{
"Date": "Sat, 26 Oct 2019 05:24:58 GMT",
"Content-Type": "text/html; charset=gbk",
"Content-Length": "42964",
"Connection": "keep-alive",
#...
}
Where you can see charset=gbk
in the Content-Type
header.
Upvotes: 0
Reputation: 26924
Python requests
use chardet
lib to check a text whether its appearance like a charset.
You can find more in chardet document.
Upvotes: 0