Reputation: 11423

"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

Here is my code,

for line in open('u.item'):
# Read each line

Whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(). The code looks like:

for line in open('u.item', encoding='utf-8'):
# Read each line

But again it gives the same error. What should I do then?

Upvotes: 379

Answers (21)

Anoop Ashware

Reputation: 39

The encoding replaced with encoding='ISO-8859-1'

for line in open('u.item', encoding='ISO-8859-1'):
#      print(line)

Upvotes: 3

Vineet Singh

Reputation: 347

I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 183: invalid continuation byte

So this is how I fixed it.

import pandas as pd

pd.read_csv('top50.csv', encoding='ISO-8859-1')

Upvotes: 9

Megha Chovatiya

Reputation: 533

Just open the csv file and save as 'CSV UTF-8 (Comma delimited) (*.csv)'. You will find this in the list of Save As file options.

Once the file is save and closed then import the data

data = pd.read_csv('file_name.csv')

Upvotes: 0

Mark Smith

Reputation: 134

My issue was similar in that UTF-8 text was getting passed to the Python script.

In my case, it was from SQL using the sp_execute_external_script in the Machine Learning service for SQL Server. For whatever reason, VARCHAR data appears to get passed as UTF-8, whereas NVARCHAR data gets passed as UTF-16.

Since there's no way to specify the default encoding in Python, and no user-editable Python statement parsing the data, I had to use the SQL CONVERT() function in my SELECT query in the @input_data parameter.

So, while this query

EXEC sp_execute_external_script @language = N'Python', 
@script = N'
OutputDataSet = InputDataSet
', 
@input_data_1 = N'SELECT id, text FROM the_error;'
WITH RESULT SETS (([id] int, [text] nvarchar(max)));

gives the error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: unexpected end of data

Using CONVERT(type, data) (CAST(data AS type) would also work)

EXEC sp_execute_external_script @language = N'Python', 
@script = N'
OutputDataSet = InputDataSet
', 
@input_data_1 = N'SELECT id, CONVERT(NVARCHAR(max), text) FROM the_error;'
WITH RESULT SETS (([id] INT, [text] NVARCHAR(max)));

returns

id  text
1   Ç

Upvotes: 0

afrah

Reputation: 101

In my case, this issue occurred because I modified the extension of an excel file (.xlsx) directly into a (.csv) file directly...

The solution was to open the file then save it as new (.csv) file (i.e. file -> save as -> select the (.csv) extension and save it. This worked for me.

Upvotes: 0

darren

Reputation: 5754

I keep coming across this error and often the solution is not resolved by encoding='utf-8' but in fact with engine='python' like this:

import pandas as pd

file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df

A link to the docs is here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Upvotes: 1

Kalluri

Reputation: 41

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte

The above error is occuring due to encoding

Solution:- Use “encoding='latin-1'”

Reference:- https://pandas.pydata.org/docs/search.html?q=encoding

Upvotes: 4

SONY ANNEM

Reputation: 1

Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')

Upvotes: 0

Alain Cherpin

Reputation: 305

Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.

If your script runs on a Linux OS, you can get the encoding with the file command:

file --mime-encoding <filename>

Here is a python script to do that for you:

import sys
import subprocess

if len(sys.argv) < 2:
    print("Usage: {} <filename>".format(sys.argv[0]))
    sys.exit(1)

def find_encoding(fname):
    """Find the encoding of a file using file command
    """

    # find fullname of file command
    which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
    if which_run.returncode != 0:
        print("Unable to find 'file' command ({})".format(which_run.returncode))
        return None

    file_cmd = which_run.stdout.decode().replace('\n', '')

    # run file command to get MIME encoding
    file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE)
    if file_run.returncode != 0:
        print(file_run.stderr.decode(), file=sys.stderr)

    # return  encoding name only
    return file_run.stdout.decode().split()[1]

# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))

Upvotes: 6

Ryoji Kuwae Neto

Reputation: 1057

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

Upvotes: 92

Nikita Axenov

Reputation: 11

So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.

I had problem with .csv file opening with that description:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte

I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol. I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.

Upvotes: 1

JGaber

Reputation: 55

Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.

Upvotes: 2

Farid Chowdhury

Reputation: 3170

You can try this way:

open('u.item', encoding='utf8', errors='ignore')

Upvotes: 8

Ayesha Siddiqa

Reputation: 345

This works:

open('filename', encoding='latin-1')

Or:

open('filename', encoding="ISO-8859-1")

Upvotes: 23

Ozcar Nguyen

Reputation: 199

You could resolve the problem with:

for line in open(your_file_path, 'rb'):

'rb' is reading the file in binary mode. Read more here.

Upvotes: 18

xtluo

Reputation: 2121

Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:

import os
assert os.path.isfile(filepath)

Upvotes: 2

SujitS

Reputation: 11423

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

Upvotes: 670

Jeril

Reputation: 8551

If you are using Python 2, the following will be the solution:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # Do something

Because the encoding parameter doesn't work with open(), you will be getting the following error:

TypeError: 'encoding' is an invalid keyword argument for this function

Upvotes: 18

Shashank

Reputation: 533

Try this to read using Pandas:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

Upvotes: 29

user6832484

Reputation: 51

This is an example for converting a CSV file in Python 3:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass

Upvotes: 5

Mark Ransom

Reputation: 308530

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding, for example, the 0xe9 would be the character é.

Upvotes: 42

&quot;for line in...&quot; results in UnicodeDecodeError: &#39;utf-8&#39; codec can&#39;t decode byte

Answers (21)

Related Questions

"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte