Reputation: 15772
I am using Python-2.6 CGI
scripts but found this error in server log while doing json.dumps()
,
Traceback (most recent call last):
File "/etc/mongodb/server/cgi-bin/getstats.py", line 135, in <module>
print json.dumps(__getdata())
File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
Here, __getdata()
function returns a dictionary.
Before posting this question, I read this question on SO. How do I resolve this error?
Upvotes: 435
Views: 2195352
Reputation: 724
The problem is simple: some non ascii text has been encoded to bytes with a different encoding from the one you are using. (Of course if you don't have "special-chars" the charset does not matter much)
Example:
my_text = "Temp in °"
my_encoded_text = bytes(my_text,'iso-8859-1')
my_encoded_text.decode('utf-8')
This will throw the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 8: invalid start byte
whereas if you use the same charset to decode it
my_text = "Temp in °"
my_encoded_text = bytes(my_text,'iso-8859-1')
my_encoded_text.decode('iso-8859-1')
'Temp in °'
If you don't know the charset that has been used to perform the encoding, and you're dealing with a file, then you can use 'chardet' (install it first wink, wink).
import chardet
file_name = 'the_file_you_want_to_read.csv'
with open(file_name, 'rb') as f:
result = chardet.detect(f.read())
detected_charset = result['encoding']
You can use the detected_charset to decode the file.
Upvotes: 2
Reputation: 23439
There are a lot answers here that suggest to use one encoding or the other to make the error go away. I think you really shouldn't do that. For example, if you're trying to load a CSV file into memory using pandas etc., then using encodings like latin1
or unicode_escape
will get rid of the error but will produce gibberish for the actual row that is giving the error and you will just silently lose data.
If you get this error and you absolutely know the problem is related to encoding, then the solution is to figure out the correct encoding; e.g. datasets are usually accompanied by another dictionary for the metadata; webpages have their encoding in their header etc., or just ask the people who prepared the data etc.
However, the error might also signal that what you're attempting is not supposed to be done. For example, if you try to decode an image file read as a bytes object, it will throw the error in the title, but realistically, you never are supposed to do it.
with open("myimage.png", "rb") as f:
data = f.read()
data.decode() # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
You probably want to read the image into a numeric array in the first place, so the solution is to use a dedicated image reader module instead and convert the data into an array without the middleman process of decoding a bytes.
from PIL import Image
import numpy as np
data = np.array(Image.open("myimage.png"))
Upvotes: 0
Reputation: 4604
If you get this error when trying to read a csv file, the read_csv()
function from pandas
lets you set the encoding:
import pandas as pd
data = pd.read_csv(filename, encoding='unicode_escape')
Upvotes: 426
Reputation: 8998
By default open function has io attribute 'r' as in read only. This can be set to 'rb' as in read binary.
Try the below code snippet:
with open(path, 'rb') as f:
text = f.read()
Upvotes: 195
Reputation: 6086
I know this doesn't fit directly to the question, but I repeatedly get directed to this when I google the error message.
I did get the error when I mistakenly tried to install a Python package like I would install requirements from a file, i.e., with -r
:
# wrong: leads to the error above
pip install -r my_package.whl
# correct: without -r
pip install my_package.whl
I hope this helps others who made the same little mistake as I did without noticing.
Upvotes: 0
Reputation: 41
Simple solution:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python-fwf')
If it's not working try to change the engine
to 'python'
or 'c'
.
Upvotes: 1
Reputation: 21
I encountered the same error while trying to import to a pandas dataframe from an excel sheet on sharepoint. My solution was using engine='openpyxl'. I'm also using requests_negotiate_sspi to avoid storing passwords in plain text.
import requests
from io import BytesIO
from requests_negotiate_sspi import HttpNegotiateAuth
cert = r'c:\path_to\saved_certificate.cer'
target_file_url = r'https://share.companydomain.com/sites/Sitename/folder/excel_file.xlsx'
response = requests.get(target_file_url, auth=HttpNegotiateAuth(), verify=cert)
df = pd.read_excel(BytesIO(response.content), engine='openpyxl', sheet_name='Sheet1')
Upvotes: 1
Reputation: 129
The following snippet worked for me.
import pandas as pd
df = pd.read_csv(filename, sep = ';', encoding = 'latin1', error_bad_lines=False) #error_bad_lines is avoid single line error
Upvotes: 10
Reputation: 2370
In my case, i had to save the file as UTF8 with BOM not just as UTF8 utf8
then this error was gone.
Upvotes: 0
Reputation: 21
After trying all the aforementioned workarounds, if it still throws the same error, you can try exporting the file as CSV
(a second time if you already have).
Especially if you're using scikit learn
, it is best to import
the dataset as a CSV file
.
I spent hours together, whereas the solution was this simple. Export the file as a CSV to the directory where Anaconda
or your classifier tools are installed and try.
Upvotes: 2
Reputation: 171
If the above methods are not working for you, you may want to look into changing the encoding
of the csv file
itself.
Using Excel:
csv
file using Excel
CSV (Comma delimited) (*.csv)
optionUnicode (UTF-8)
from Save this document as drop-down listUsing Notepad:
csv file
using notepad.csv
extensionUTF-8
option.By doing this, you should be able to import csv
files without encountering the UnicodeCodeError
.
Upvotes: 17
Reputation: 142528
Instead of looking for ways to decode a5 (Yen ¥
) or 96 (en-dash –
), tell MySQL that your client is encoded "latin1", but you want "utf8" in the database.
See details in Trouble with UTF-8 characters; what I see is not what I stored
Upvotes: 1
Reputation: 5489
This solution worked for me:
import pandas as pd
data = pd.read_csv("training.csv", encoding = 'unicode_escape')
Upvotes: 40
Reputation: 701
Your string has a non ascii
character encoded in it.
Not being able to decode with utf-8
may happen if you've needed to use other encodings in your code. For example:
>>> 'my weird character \x96'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 19: invalid start byte
In this case, the encoding is windows-1252
so you have to do:
>>> 'my weird character \x96'.decode('windows-1252')
u'my weird character \u2013'
Now that you have Unicode
, you can safely encode into utf-8
.
Upvotes: 56
Reputation: 1161
You may use any standard encoding of your specific usage and input.
utf-8
is the default.
iso8859-1
is also popular for Western Europe.
e.g: bytes_obj.decode('iso8859-1')
see: docs
Upvotes: 4
Reputation: 8071
Inspired by @aaronpenne and @Soumyaansh
f = open("file.txt", "rb")
text = f.read().decode(errors='replace')
Upvotes: 29
Reputation: 600
As of 2018-05 this is handled directly with decode
, at least for Python 3.
I'm using the below snippet for invalid start byte
and invalid continuation byte
type errors. Adding errors='ignore'
fixed it for me.
with open(out_file, 'rb') as f:
for line in f:
print(line.decode(errors='ignore'))
Upvotes: 17
Reputation: 419
On read csv
, I added an encoding method:
import pandas as pd
dataset = pd.read_csv('sample_data.csv', header= 0,
encoding= 'unicode_escape')
Upvotes: 41
Reputation: 14003
Simple Solution:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')
Upvotes: 17
Reputation: 3145
The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded. One simple way to avoid this error is to encode such strings with encode()
function as follows (if a
is the string with non-ascii character):
a.encode('utf-8').strip()
Upvotes: 121
Reputation: 9850
Set default encoder at the top of your code
import sys
reload(sys)
sys.setdefaultencoding("ISO-8859-1")
Upvotes: 20
Reputation: 15772
Following line is hurting JSON encoder,
now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) // this is the culprit
I got a temporary fix for it
print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })
Marking this as correct as a temporary fix (Not sure so).
Upvotes: 8