Reputation: 1431
A help needed with a pretty simple Python 3.6 script.
First, it downloads an HTML file from an old-fashioned server which uses cp1251 encoding.
Then I need to put the file contents into a UTF-8 encoded string.
Here is what I'm doing:
import requests
import codecs
#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')
#checking that it's in cp1251
print(ri.encoding)
#encoding using cp1251
text = ri.text
text = codecs.encode(text,'cp1251')
#decoding using utf-8 - ERROR HERE!
text = codecs.decode(text,'utf-8')
print(text)
Here is the error:
Traceback (most recent call last):
File "main.py", line 15, in <module>
text = codecs.decode(text,'utf-8')
File "/var/lang/lib/python3.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 43: invalid continuation byte
I'd really appreciate any help with it.
Upvotes: 0
Views: 3532
Reputation: 338406
Not sure what you are trying to do.
.text
is the text of the response, a Python string. Encodings don't play any role in Python strings.
Encodings only play a role when you have a stream of bytes that you want to convert to a string (or the other way around). And the requests module already does that for you.
import requests
ri = requests.get('http://old.moluch.ru/_python_test/0.html')
print(ri.text)
For example, assume you have a text file (i.e.: bytes). Then you must pick an encoding when you open()
the file - the choice of encoding determines how the bytes in the file are converted into characters. This manual step is necessary because open()
cannot know what encoding the bytes of the file are in.
HTTP on the other hand sends this in the response headers (Content-Type
), so requests
can know this information. Being a high-level module, it helpfully looks at the HTTP headers and converts the incoming bytes for you. (If you would use the much more low-level urllib
, you'd have to do your own decoding.)
The .encoding
property is purely informational when you use the .text
of the response. It might be relevant if you use the .raw
property, though. For work with servers that return regular text responses, using .raw
is seldom necessary.
Upvotes: 2
Reputation: 2600
You don't need to do the encoding/decoding.
"When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text"
So this will work:
import requests
#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')
text = ri.text
print(text)
You can also access the response body as bytes, for non-text requests:
ri.content
Please check out the requests documentation
Upvotes: 1
Reputation: 1827
When many of the people have already answered that you are getting the decoded message when you make requests.get. I will answer to the error you are facing right now.
This Line:
text = codecs.encode(text,'cp1251')
Encodes the text into cp1251, you are then trying to decode it using utf-8 which gives you the error here:
text = codecs.decode(text,'utf-8')
For detecting the types you can use:
import chardet
text = codecs.encode(text,'cp1251')
chardet.detect(text) . #output {'encoding': 'windows-1251', 'confidence': 0.99, 'language': 'Russian'}
#OR
text = codecs.encode(text,'utf-8')
chardet.detect(text) . #output {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
So encoding in one format and then decoding in other causes the error.
Upvotes: 1
Reputation: 185
you can simply ignore the error with adding a setting to the decode function :
text = codecs.decode(text,'utf-8',errors='ignore')
Upvotes: -1