Reputation: 346
I get information from the page, but it is encoded with an encoding that doesn’t suit me:
response = session.post(
url=uri,
headers={
'Accept-Charset': 'utf-8'
}
)
error_message = re.search(r'b-content__red\"\>(.+?)\<', response.text)
Text:
Обратитесь в службу поддержки вашего банка.
Then i'm trying to convert it:
import cchardet
if error_message:
error_message = error_message.group(1).encode()
encoding = cchardet.detect(error_message)['encoding']
if 'UTF-8' != encoding.upper():
error_message = error_message.decode('utf-8')
But the result is still the same. What am I doing wrong?
Upvotes: 2
Views: 74
Reputation: 4365
What you need is the builtin html
module.
import html
response = session.post(
url=uri,
headers={
'Accept-Charset': 'utf-8'
}
)
error_message = re.search(r'b-content__red\"\>(.+?)\<', response.text)
if error_message:
error_message = html.unescape(error_message.group(1))
print(error_message)
The problem with your approach was that .encode()
expects a valid string 'Обратитесь в службу поддержки вашего банка'.encode()
and .decode()
expects a valid python bytestring. b'\xd0\x9e\xd0\xb1\xd1\x80\xd0\xb0\xd1\x82\xd0\xb8\xd1\x82\xd0\xb5\xd1\x81\xd1\x8c'.decode('utf8')
Luckily, python provides an easy way to parse html entities.
Upvotes: 1