Reputation: 33
Code is presented below. Runs with python 2 in Debian 9.
# -*- coding: utf-8 -*-
import requests
import bs4
# repairing invalid HTML
s = requests.get('http://vstup.info/2017/i2017i483.html')
tmp = s.text.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")
bs = bs4.BeautifulSoup(tmp, "html.parser")
content = bs.find("div", {"id": "okrArea"}).find("table", {"id": "about"}).findAll("tr")
typ = content[1].findAll("td")[1].get_text() #ZVO type
print typ
print [typ]
It outputs this:
ТеÑ
нÑкÑм (ÑÑилиÑе)
[u'\xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd1\x96\xd0\xba\xd1\x83\xd0\xbc (\xd1\x83\xd1\x87\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x89\xd0\xb5)']
Технікум (училище)
In interactive python it can be get from backslashed codes in this way
>>> print '\xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd1\x96\xd0\xba\xd1\x83\xd0\xbc (\xd1\x83\xd1\x87\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x89\xd0\xb5)'.decode('utf8')
Технікум (училище)
Upvotes: 1
Views: 91
Reputation: 1121266
You made the mistake of trusting the HTTP content character set set by the server, by using response.text
. This gives you Unicode text decoded from the binary response data using the header information, which here is wrong. You then give the Unicode string to BeautifulSoup, which assumes that it was correctly decoded.
Instead, use the response.content
attribute, which gives you the raw binary string content body:
tmp = s.conent.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")
Now the data remains a binary string and BeautifulSoup will do the decoding for you, based on information in the HTML document itself (there’s a <meta>
tag with the correct codec information in there):
>>> import requests, bs4
>>> s = requests.get('http://vstup.info/2017/i2017i483.html')
>>> tmp = s.content.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")
>>> bs = bs4.BeautifulSoup(tmp, "html.parser")
>>> content = bs.select("div#okrArea table#about tr")
>>> typ = content[1].findAll("td")[1].get_text()
>>> print typ
Технікум (училище)
Upvotes: 3
Reputation: 82755
Use encoding latin1
Ex:
import requests
import bs4
s = requests.get('http://vstup.info/2017/i2017i483.html')
tmp = s.text.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")
bs = bs4.BeautifulSoup(tmp, "html.parser")
content = bs.find("div", {"id": "okrArea"}).find("table", {"id": "about"}).findAll("tr")
typ = content[1].findAll("td")[1].get_text() #ZVO type
print typ.encode("latin1")
Output:
Технікум (училище)
Upvotes: 2