Reputation: 849
I am using Anaconda Python 3.7 Jupyter Notebook with the requests module to scrape some video game data from a website.
The game "Brütal Legend" has an umlaut and appears correctly on the website I am scraping from, but when I get the data via the requests module, it shows up with the special character no longer in tact. For example, this is what I get:
Brütal Legend
Here is what my code looks like:
import requests
targetURL = 'https://www.url.com/redacted.php?query'
r = requests.get(targetURL)
page_source = r.text
print("raw page_source", page_source)
What can I do to preserve the special character so that it shows up correctly in the output of my Jupyter Notebook?
Upvotes: 2
Views: 3935
Reputation: 3107
You need to know charset which in Response's Content-Type
, even though most of websites use utf8. response.text
will use default encoding UTF8 , because it uses decode()
and Response default encoding is None.
Note: A number of sites didn't show charset, but they may use utf8.
http://docs.python-requests.org/en/master/api/?highlight=encod#requests.Response.encoding
So why you got Brütal Legend
is you using wrong encoding to convert bytes into string. You should try r.content.decode("ISO-8859-1")
A simple example:
import requests
with requests.Session() as s:
utf_8 = s.get("https://en.wikipedia.org/wiki/Br%C3%BCtal_Legend")
#response charset is UTF8
print(utf_8.text[101:126])
print(utf_8.content.decode("utf8")[101:126])
print(utf_8.content[101:127].decode("ISO-8859-1"))
Output:
Brütal Legend - Wikipedia
Brütal Legend - Wikipedia
Brütal Legend - Wikipedia
Edit:
print("Brütal Legend".encode("ISO-8859-1").decode())
#Brütal Legend
Upvotes: 2