nosheyakku
nosheyakku

Reputation: 346

Why doesn't the encoding change?

I get information from the page, but it is encoded with an encoding that doesn’t suit me:

response = session.post(
    url=uri,
    headers={
        'Accept-Charset': 'utf-8'
    }
)

error_message = re.search(r'b-content__red\"\>(.+?)\<', response.text)

Text:

&#x41E;&#x431;&#x440;&#x430;&#x442;&#x438;&#x442;&#x435;&#x441;&#x44C; &#x432; &#x441;&#x43B;&#x443;&#x436;&#x431;&#x443; &#x43F;&#x43E;&#x434;&#x434;&#x435;&#x440;&#x436;&#x43A;&#x438; &#x432;&#x430;&#x448;&#x435;&#x433;&#x43E; &#x431;&#x430;&#x43D;&#x43A;&#x430;.

Then i'm trying to convert it:

import cchardet


if error_message:
    error_message = error_message.group(1).encode()
    encoding = cchardet.detect(error_message)['encoding']

    if 'UTF-8' != encoding.upper():
        error_message = error_message.decode('utf-8')

But the result is still the same. What am I doing wrong?

Upvotes: 2

Views: 74

Answers (1)

Axe319
Axe319

Reputation: 4365

What you need is the builtin html module.

import html
response = session.post(
    url=uri,
    headers={
        'Accept-Charset': 'utf-8'
    }
)

error_message = re.search(r'b-content__red\"\>(.+?)\<', response.text)
if error_message:
    error_message = html.unescape(error_message.group(1))
    print(error_message)

The problem with your approach was that .encode() expects a valid string 'Обратитесь в службу поддержки вашего банка'.encode() and .decode() expects a valid python bytestring. b'\xd0\x9e\xd0\xb1\xd1\x80\xd0\xb0\xd1\x82\xd0\xb8\xd1\x82\xd0\xb5\xd1\x81\xd1\x8c'.decode('utf8')

Luckily, python provides an easy way to parse html entities.

Upvotes: 1

Related Questions