Chris Nielsen
Chris Nielsen

Reputation: 849

Preserving special characters when using requests module

I am using Anaconda Python 3.7 Jupyter Notebook with the requests module to scrape some video game data from a website.

The game "Brütal Legend" has an umlaut and appears correctly on the website I am scraping from, but when I get the data via the requests module, it shows up with the special character no longer in tact. For example, this is what I get:

Brütal Legend

Here is what my code looks like:

import requests

targetURL = 'https://www.url.com/redacted.php?query'
r = requests.get(targetURL)
page_source = r.text
print("raw page_source", page_source)

What can I do to preserve the special character so that it shows up correctly in the output of my Jupyter Notebook?

Upvotes: 2

Views: 3935

Answers (1)

KC.
KC.

Reputation: 3107

You need to know charset which in Response's Content-Type , even though most of websites use utf8. response.text will use default encoding UTF8 , because it uses decode() and Response default encoding is None.

Note: A number of sites didn't show charset, but they may use utf8.

http://docs.python-requests.org/en/master/api/?highlight=encod#requests.Response.encoding

So why you got Brütal Legend is you using wrong encoding to convert bytes into string. You should try r.content.decode("ISO-8859-1")

A simple example:

import requests
with requests.Session() as s:
    utf_8 = s.get("https://en.wikipedia.org/wiki/Br%C3%BCtal_Legend")
    #response charset is UTF8
    print(utf_8.text[101:126])
    print(utf_8.content.decode("utf8")[101:126])

    print(utf_8.content[101:127].decode("ISO-8859-1"))

Output:

Brütal Legend - Wikipedia
Brütal Legend - Wikipedia
Brütal Legend - Wikipedia

Edit:

print("Brütal Legend".encode("ISO-8859-1").decode())
#Brütal Legend

Upvotes: 2

Related Questions