Denis Stepanov
Denis Stepanov

Reputation: 33

Python output encoding

Code is presented below. Runs with python 2 in Debian 9.

# -*- coding: utf-8 -*- 
import requests
import bs4

# repairing invalid HTML
s = requests.get('http://vstup.info/2017/i2017i483.html')
tmp = s.text.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

bs = bs4.BeautifulSoup(tmp, "html.parser")

content = bs.find("div", {"id": "okrArea"}).find("table", {"id": "about"}).findAll("tr")

typ = content[1].findAll("td")[1].get_text() #ZVO type

print typ
print [typ]

It outputs this:

ТеÑ
нÑкÑм (ÑÑилиÑе)
[u'\xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd1\x96\xd0\xba\xd1\x83\xd0\xbc (\xd1\x83\xd1\x87\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x89\xd0\xb5)']
  1. Why do variable print output differs from this variable in list?
  2. How to get correct value from web-page

Технікум (училище)

In interactive python it can be get from backslashed codes in this way

>>> print '\xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd1\x96\xd0\xba\xd1\x83\xd0\xbc (\xd1\x83\xd1\x87\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x89\xd0\xb5)'.decode('utf8')
Технікум (училище)

Upvotes: 1

Views: 91

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121266

You made the mistake of trusting the HTTP content character set set by the server, by using response.text. This gives you Unicode text decoded from the binary response data using the header information, which here is wrong. You then give the Unicode string to BeautifulSoup, which assumes that it was correctly decoded.

Instead, use the response.content attribute, which gives you the raw binary string content body:

tmp = s.conent.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

Now the data remains a binary string and BeautifulSoup will do the decoding for you, based on information in the HTML document itself (there’s a <meta> tag with the correct codec information in there):

>>> import requests, bs4
>>> s = requests.get('http://vstup.info/2017/i2017i483.html')
>>> tmp = s.content.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")
>>> bs = bs4.BeautifulSoup(tmp, "html.parser")
>>> content = bs.select("div#okrArea table#about tr")
>>> typ = content[1].findAll("td")[1].get_text()
>>> print typ
Технікум (училище)

Upvotes: 3

Rakesh
Rakesh

Reputation: 82755

Use encoding latin1

Ex:

import requests
import bs4

s = requests.get('http://vstup.info/2017/i2017i483.html')
tmp = s.text.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

bs = bs4.BeautifulSoup(tmp, "html.parser")

content = bs.find("div", {"id": "okrArea"}).find("table", {"id": "about"}).findAll("tr")

typ = content[1].findAll("td")[1].get_text() #ZVO type

print typ.encode("latin1")

Output:

Технікум (училище)

Upvotes: 2

Related Questions