oscarcapote
oscarcapote

Reputation: 477

'utf-8' codec can't decode byte reading a file in Python3.4 but not in Python2.7

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is: Codi;Codi_lloc_anonim;Nom

and the code of my program is:

def lectdict(filename,colkey,colvalue):
    f = open(filename,'r')
    D = dict()

    for line in f:
       if line == '\n': continue
       D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]

f.close
return D

Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

Upvotes: 10

Views: 21451

Answers (3)

dyomas
dyomas

Reputation: 720

In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte

My decision is to open file in binary mode:

open(filename, 'rb')

Upvotes: 2

oscarcapote
oscarcapote

Reputation: 477

Ok, I did the same as @unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :

f = open(filename,'r')

to

f = open(filename,'r', encoding='cp1250')

like @triplee suggest me. And now I can read my files.

Upvotes: 3

unutbu
unutbu

Reputation: 879471

In Python2,

f = open(filename,'r')
for line in f:

reads lines from the file as bytes.

In Python3, the same code reads lines from the file as strings. Python3 strings are what Python2 call unicode objects. These are bytes decoded according to some encoding. The default encoding in Python3 is utf-8.

The error message

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.

To fix the problem you need to specify the correct encoding of the file:

with open(filename, encoding=enc) as f:
    for line in f:

If you do not know the correct encoding, you could run this program to simply try all the encodings known to Python. If you are lucky there will be an encoding which turns the bytes into recognizable characters. Sometimes more than one encoding may appear to work, in which case you'll need to check and compare the results carefully.

# Python3
import pkgutil
import os
import encodings

def all_encodings():
    modnames = set(
        [modname for importer, modname, ispkg in pkgutil.walk_packages(
            path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
    try:
        with open(filename, encoding=enc) as f:
            # print the encoding and the first 500 characters
            print(enc, f.read(500))
    except Exception:
        pass

Upvotes: 22

Related Questions