user7693832
user7693832

Reputation: 6849

How does requests determine the encoding of a reponse?

How can a response's apparent_encoding attribute be incorrect?

I have the below code snippet, demonstrates my question:

import requests

url = "https://item.jd.com/100000177760.html"

r = requests.get(url)

print(r.status_code, r.encoding)  # 200, gbk (correct)

print(r.apparent_encoding)  # GB2312 (wrong)

How does requests determine the response's characters encoding?

Upvotes: 2

Views: 3446

Answers (3)

snakecharmerb
snakecharmerb

Reputation: 55844

Requests extracts the encoding from the response's Content-Type header's charset parameter. If no charset is found in the header and the content-type is of type "text", ISO-8859-1 (latin-1) is assumed. Otherwise the response's apparent_encoding property is evaluated and used as the value of r.encoding.

apparent_encoding is determined by using the chardet library to guess the encoding of the response body.

In the case of the URL in the question, the encoding is declared in the Content-Type header

>>> r.headers['Content-Type']
'text/html; charset=gbk'

so r.apparent_encoding is not evaluated until it is explicitly accessed by executing print(r.apparent_encoding).

In this particular case, chardet seems to get it wrong: the response's text attribute can be encoded with the gbk codec, but not with GB2312.

Upvotes: 4

Henry Woody
Henry Woody

Reputation: 15691

The requests library can use the HTTP headers set on the response to figure out the response's encoding.

In your example:

url = "https://item.jd.com/100000177760.html"
r = requests.get(url)
print(r.headers)

with result:

{
    "Date": "Sat, 26 Oct 2019 05:24:58 GMT",
    "Content-Type": "text/html; charset=gbk",
    "Content-Length": "42964",
    "Connection": "keep-alive",
    #...
}

Where you can see charset=gbk in the Content-Type header.

Upvotes: 0

aircraft
aircraft

Reputation: 26924

Python requests use chardet lib to check a text whether its appearance like a charset.

You can find more in chardet document.

Upvotes: 0

Related Questions