Eden Crow
Eden Crow

Reputation: 16204

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:

Traceback (most recent call last):  
  File "SCRIPT LOCATION", line NUMBER, in <module>  
    text = file.read()
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode  
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`  

After reading this Q&A, see How to determine the encoding of text if you need help figuring out the encoding of the file you are trying to open.

Upvotes: 1132

Views: 1967255

Answers (16)

Sergei
Sergei

Reputation: 315

This check helped me solve the issue:

with open(input_file, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
encoding = result['encoding']

print(f"Detected encoding: {encoding}")

with open(input_file, 'r', newline='', encoding=encoding, errors='replace') as csvfile:
 reader = csv.reader(csvfile)
 # read the file...

Upvotes: 0

rha
rha

Reputation: 749

TLDR: Try: file = open(filename, encoding='cp437')

Why? When one uses:

file = open(filename)
text = file.read()

Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be not handled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.

If such characters are unneeded, one may decide to replace them by question marks, with:

file = open(filename, errors='replace')

Another workaround is to use:

file = open(filename, errors='ignore')

The characters are then left intact, but other errors will be masked too.

A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which maps every single-byte value (0..255) to a character (like cp437 or latin1):

file = open(filename, encoding='cp437')

Codepage 437 is just an example. It is the original DOS encoding. All codes are mapped, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable) and one can check their ord() values.

Please note that this advice is just a quick workaround for a nasty problem. Proper solution is to use binary mode, although it is not so quick.

Upvotes: 61

navalega0109
navalega0109

Reputation: 400

Below code will encode the utf8 symbols.

with open("./website.html", encoding="utf8") as file:
    contents = file.read()

Upvotes: 9

Sayantam
Sayantam

Reputation: 964

If you are on Windows, the file may be starting with a UTF-8 BOM indicating it definitely is a UTF-8 file. As per https://bugs.python.org/issue44510, I used encoding="utf-8-sig", and the csv file was read successfully.

Upvotes: 3

Lennart Regebro
Lennart Regebro

Reputation: 172339

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.

You specify the encoding when you open the file:

file = open(filename, encoding="utf8")

Upvotes: 1929

Hellena Crainicu
Hellena Crainicu

Reputation: 49

This is an example of how I open and close file with UTF-8, extracted from a recent code:

def traducere_v1_txt(translator, file):
  data = []
  with open(f"{base_path}/{file}" , "r" ,encoding='utf8', errors='ignore') as open_file:
    data = open_file.readlines()
    
    
file_name = file.replace(".html","")
        with open(f"Translated_Folder/{file_name}_{input_lang}.html","w", encoding='utf8') as htmlfile:
          htmlfile.write(lxml1)

Upvotes: 1

Just Me
Just Me

Reputation: 1063

def read_files(file_path):

    with open(file_path, encoding='utf8') as f:
        text = f.read()
        return text

OR (AND)

def read_files(text, file_path):

    with open(file_path, 'rb') as f:
        f.write(text.encode('utf8', 'ignore'))

OR

document = Document()
document.add_heading(file_path.name, 0)
    file_path.read_text(encoding='UTF-8'))
        file_content = file_path.read_text(encoding='UTF-8')
        document.add_paragraph(file_content)

OR

def read_text_from_file(cale_fisier):
    text = cale_fisier.read_text(encoding='UTF-8')
    print("what I read: ", text)
    return text # return written text

def save_text_into_file(cale_fisier, text):
    f = open(cale_fisier, "w", encoding = 'utf-8') # open file
    print("Ce am scris: ", text)
    f.write(text) # write the content to the file

OR

def read_text_from_file(file_path):
    with open(file_path, encoding='utf8', errors='ignore') as f:
        text = f.read()
        return text # return written text


def write_to_file(text, file_path):
    with open(file_path, 'wb') as f:
        f.write(text.encode('utf8', 'ignore')) # write the content to the file

OR

import os
import glob

def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
    '''
    Read the file at path fname with its original encoding (from_encoding)
    and rewrites it with to_encoding.
    '''
    with open(fname, encoding=from_encoding) as f:
        text = f.read()

    with open(fname, 'w', encoding=to_encoding) as f:
        f.write(text)

Upvotes: 8

Piyush raj
Piyush raj

Reputation: 29

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1) enter image description here

Upvotes: 1

Arthur MacMillan
Arthur MacMillan

Reputation: 101

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).

Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

Upvotes: 4

Kyle Parisi
Kyle Parisi

Reputation: 1416

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:

open(filename, 'rb')

where r = reading, b = binary

Upvotes: 90

Declan Nnadozie
Declan Nnadozie

Reputation: 2067

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Upvotes: 154

gabi939
gabi939

Reputation: 107

for me encoding with utf16 worked

file = open('filename.csv', encoding="utf16")

Upvotes: 4

hanna
hanna

Reputation: 655

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)

and then consider removing it from the file.

Upvotes: 5

E.Zolduoarrati
E.Zolduoarrati

Reputation: 1659

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:

open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')

Godspeed

Upvotes: 17

Matas Vaitkevicius
Matas Vaitkevicius

Reputation: 61489

As an extension to @LennartRegebro's answer:

If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.

EDIT: (Copied from comment)

A quite popular text editor Sublime Text has a command to display encoding if it has been set...

  1. Go to View -> Show Console (or Ctrl+`)

enter image description here

  1. Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

enter image description here

Upvotes: 44

Antoni
Antoni

Reputation: 2622

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.

Open the file in Notepad++. In the bottom right it will tell you the current file encoding. In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Upvotes: 3

Related Questions