Reputation: 36030
I received some text that is encoded, but I don't know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the encoding/codepage of a text file deals with C#.
Upvotes: 323
Views: 484662
Reputation: 3609
Summary:
When the input is random bytes (no text encoding is detected) then you probably want hex_escape
Compare some algorithms: guess-encoding-of-bytestring.py
#!/usr/bin/env python3
# guess text encoding of bytestring
# [cchardet]: https://github.com/PyYoshi/cChardet
# [faust-cchardet]: https://github.com/faust-streaming/cChardet
# [uchardet]: https://gitlab.freedesktop.org/uchardet/uchardet
# good for short strings
# fails on long strings
def guess_encoding_cchardet(bs: bytes):
return cchardet.detect(bs).get("encoding")
# [charset_normalizer]: https://github.com/jawah/charset_normalizer
# [charset_normalizer#566]: https://github.com/jawah/charset_normalizer/issues/566
# good for long strings
# fails on short strings
# https://github.com/jawah/charset_normalizer/issues/486
# 20x faster than chardet [charset_normalizer]
# -> 200x slower than cchardet
# 5x slower than cchardet [charset_normalizer#566]
# benchmark versus chardet
# https://github.com/jawah/charset_normalizer/raw/master/bin/performance.py
def guess_encoding_charset_normalizer(bs: bytes):
match = charset_normalizer.from_bytes(bs).best()
if match:
return match.encoding
return None
# [rs_chardet]: https://github.com/emattiza/rs_chardet
# 40x slower than cchardet [rs_chardet]
def guess_encoding_rs_chardet(bs: bytes):
return rs_chardet.detect_rs_enc_name(bs)
# return rs_chardet.detect_codec(bs).name
# [chardet]: https://github.com/chardet/chardet
# 4000x slower than cchardet [rs_chardet]
# 2000x slower than cchardet [cchardet]
def guess_encoding_chardet(bs: bytes):
return chardet.detect(bs).get("encoding")
# [magic]: https://github.com/ahupp/python-magic
# fails on short strings
def guess_encoding_magic(bs: bytes):
e = magic.detect_from_content(bs).encoding
if e in ("binary", "unknown-8bit"):
return None
return e
# [icu]: https://github.com/unicode-org/icu
# fails on short strings
def guess_encoding_icu(bs: bytes):
try:
return icu.CharsetDetector(bs).detect().getName()
except icu.ICUError:
return None
if __name__ == "__main__":
# test
import random
bytes_encoding_list = [
("ü".encode("latin1"), "latin1"),
("üü".encode("latin1"), "latin1"),
("üüü".encode("latin1"), "latin1"),
]
for _ in range(10):
bytes_encoding_list += [
(random.randbytes(20), None),
]
def test(guess_encoding):
global bytes_encoding_list
module_name = guess_encoding._name
for input_bytes, expected_encoding in bytes_encoding_list:
assert isinstance(input_bytes, bytes)
# TODO better...
guessed_encoding = guess_encoding(input_bytes)
actual_string = None
if guessed_encoding:
try:
actual_string = input_bytes.decode(guessed_encoding)
except Exception as exc:
if expected_encoding == None:
print(f"{module_name}: fail. found wrong encoding {guessed_encoding} in random bytes {input_bytes}")
continue
else:
print(f"{module_name}: FIXME failed to decode bytes: {exc}")
if expected_encoding == None:
# the guessed encoding can be anything -> dont compare encoding
if guessed_encoding == None:
print(f"{module_name}: ok. found no encoding in random bytes {input_bytes}")
else:
print(f"{module_name}: ok. found encoding {guessed_encoding} in random bytes {input_bytes} -> string {actual_string!r}")
else:
expected_string = input_bytes.decode(expected_encoding)
if actual_string == expected_string:
print(f"{module_name}: ok. decoded {actual_string} from {guessed_encoding} bytes {input_bytes}")
else:
#print(f"{module_name}: fail. actual {actual_string!r} from {guessed_encoding}. expected {expected_string!r} from {expected_encoding} bytes {input_bytes}")
print(f"{module_name}: fail. string: {actual_string!r} != {expected_string!r}. encoding: {guessed_encoding} != {expected_encoding}. bytes: {input_bytes}")
for k in list(globals().keys()):
if not k.startswith("guess_encoding_"):
continue
module_name = k[15:]
module_found = False
try:
module = __import__(module_name)
globals()[module_name] = module
module_found = True
except ModuleNotFoundError as exc:
print(f"{module_name}: module not found. hint: pip install {module_name}")
pass
if module_found:
guess_encoding = locals()[k]
guess_encoding._name = module_name
test(guess_encoding)
Upvotes: 0
Reputation: 8190
Some text files are aware of their encoding, but most are not. Aware:
Not aware:
Some encodings are versatile, i.e., they can decode any sequence of bytes, and some are not. US-ASCII is not versatile, since any byte greater than 127 is not mapped to any character. UTF-8 is not versatile since any sequence of bytes is not valid.
On the contrary, Latin-1, Windows-1252, etc. are versatile (even if some bytes are not officially mapped to a character):
>>> [b.to_bytes(1, 'big').decode("latin-1") for b in range(256)]
['\x00', ..., 'ÿ']
Given a random text file encoded in a sequence of bytes, you can't determine the encoding unless the file is aware of its encoding, because some encodings are versatile. But you can sometimes exclude non versatile encodings. All versatile encodings are still possible. The chardet
modules uses the frequency of bytes to guess which encoding fits the best to the encoded text.
If you don't want to use this module or a similar one, here's a simple method:
The second step is a bit risky if you check only a sample, because some bytes in the rest of the file may be invalid.
The code:
def guess_encoding(data: bytes, fallback: str = "iso8859_15") -> str:
"""
A basic encoding detector.
"""
for bom, encoding in [
(codecs.BOM_UTF32_BE, "utf_32_be"),
(codecs.BOM_UTF32_LE, "utf_32_le"),
(codecs.BOM_UTF16_BE, "utf_16_be"),
(codecs.BOM_UTF16_LE, "utf_16_le"),
(codecs.BOM_UTF8, "utf_8_sig"),
]:
if data.startswith(bom):
return encoding
if all(b < 128 for b in data):
return "ascii" # You may want to use the fallback here if data is only a sample.
decoder = codecs.getincrementaldecoder("utf_8")()
try:
decoder.decode(data, final=False)
except UnicodeDecodeError:
return fallback
else:
return "utf_8" # Not certain if data is only a sample
Remember that unversatile encodings may fail. The errors
parameter of the decode
method can be set to 'ignore'
, 'replace'
or 'backslashreplace'
to avoid exceptions.
Upvotes: 0
Reputation: 36
You can use the chardet module:
import chardet
with open (filepath , "rb") as f:
data= f.read()
encode=chardet.UniversalDetector()
encode.close()
print(encode.result)
Or you can use the chardet3
command in Linux, but it takes some time:
chardet3 fileName
Example:
chardet3 donnee/dir/donnee.csv
donnee/dir/donnee.csv: ISO-8859-1 with confidence 0.73
Upvotes: 0
Reputation: 27626
This site has Python code for recognizing ASCII, encoding with BOMs, and UTF-8 without a BOM: 8. How to guess the encoding of a document.
Here's an example. I'm on OS X.
#!/usr/bin/python
import sys
def isUTF8(data):
try:
decoded = data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
for ch in decoded:
if 0xD800 <= ord(ch) <= 0xDFFF:
return False
return True
def get_bytes_from_file(filename):
return open(filename, "rb").read()
filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)
PS /Users/js> ./isutf8.py hi.txt
True
Upvotes: 1
Reputation: 2013
You can use the python-magic
package which does not load the whole file into memory:
import magic
def detect(
file_path,
):
return magic.Magic(
mime_encoding=True,
).from_file(file_path)
The output is the encoding name, for example:
Upvotes: 0
Reputation: 5260
Using the Linux file -i
command:
import subprocess
file = "path/to/file/file.txt"
encoding = subprocess.Popen("file -bi "+file, shell=True, stdout=subprocess.PIPE).stdout
encoding = re.sub(r"(\\n)[^a-z0-9\-]", "", str(encoding.read()).split("=")[1], flags=re.IGNORECASE)
print(encoding)
Upvotes: 0
Reputation: 18988
Here is an example of reading and taking at face value a chardet
encoding prediction, reading n_lines
from the file in the event it is large.
chardet
also gives you a probability (i.e., confidence
) of its encoding prediction (I haven't looked how they come up with that), which is returned with its prediction from chardet.predict()
, so you could work that in somehow if you like.
import chardet
from pathlib import Path
def predict_encoding(file_path: Path, n_lines: int=20) -> str:
'''Predict a file's encoding using chardet'''
# Open the file as binary data
with Path(file_path).open('rb') as f:
# Join binary lines for specified number of lines
rawdata = b''.join([f.readline() for _ in range(n_lines)])
return chardet.detect(rawdata)['encoding']
Upvotes: 41
Reputation: 49
Depending on your platform, I just opt to use the Linux shell file
command. This works for me since I am using it in a script that exclusively runs on one of our Linux machines.
Obviously, this isn't an ideal solution or answer, but it could be modified to fit your needs. In my case I just need to determine whether a file is UTF-8 or not.
import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')
Upvotes: 3
Reputation: 17097
Another option for working out the encoding is to use libmagic (which is the code behind the file command). There are a profusion of Python bindings available.
The Python bindings that live in the file source tree are available as the python-magic (or python3-magic) debian package. It can determine the encoding of a file by doing:
import magic
blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob) # "utf-8", "us-ascii", etc.
There is an identically named, but incompatible, python-magic pip package on PyPI that also uses libmagic
. It can also get the encoding, by doing:
import magic
blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)
Upvotes: 104
Reputation: 11493
Some encoding strategies (please uncomment to taste):
#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile
You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the file size first:
# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # Add more
for e in encodings:
try:
fh = codecs.open('file.txt', 'r', encoding=e)
fh.readlines()
fh.seek(0)
except UnicodeDecodeError:
print('got Unicode error with %s, trying different encoding' % e)
else:
print('opening the file with encoding: %s ' % e)
break
Upvotes: 46
Reputation: 97
cchardet
faster alternative to chardet
Install: pip install cchardet
Use:
import cchardet as chardet
filepath = Path(filename)
blob = filepath.read_bytes()
detection = chardet.detect(blob)
encoding = detection["encoding"]
confidence = detection["confidence"]
Upvotes: 1
Reputation: 43
I just want to add, for everyone's information, to install magic from the Python 3 pip:
pip install python-magic
Upvotes: 0
Reputation: 192
A long time ago, I had this need.
Reading old code of mine, I found this:
import urllib.request
import chardet
import os
import settings
[...]
file = 'sources/dl/file.csv'
media_folder = settings.MEDIA_ROOT
file = os.path.join(media_folder, str(file))
if os.path.isfile(file):
file_2_test = urllib.request.urlopen('file://' + file).read()
encoding = (chardet.detect(file_2_test))['encoding']
return encoding
This worked for me and returned ascii
Upvotes: 0
Reputation: 158
If you are not satisfied with the automatic tools you can try all codecs and see which codec is right manually.
all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437',
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857',
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869',
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125',
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256',
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr',
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2',
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1',
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7',
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13',
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u',
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman',
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213',
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7',
'utf_8', 'utf_8_sig']
def find_codec(text):
for i in all_codecs:
for j in all_codecs:
try:
print(i, "to", j, text.encode(i).decode(j))
except:
pass
find_codec("The example string which includes ö, ü, or ÄŸ, ö")
This script creates at least 9409 lines of output. So, if the output cannot fit to the terminal screen try to write the output to a text file.
Upvotes: 13
Reputation: 223172
EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative
Correctly detecting the encoding all times is impossible.
(From chardet FAQ:)
However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.
You can also use UnicodeDammit. It will try the following methods:
Upvotes: 288
Reputation: 423
This might be helpful
from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
content = file.read()
suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'
Upvotes: 22
Reputation: 6665
# Function: OpenRead(file)
# A text file can be encoded using:
# (1) The default operating system code page, Or
# (2) utf8 with a BOM header
#
# If a text file is encoded with utf8, and does not have a BOM header,
# the user can manually add a BOM header to the text file
# using a text editor such as notepad++, and rerun the python script,
# otherwise the file is read as a codepage file with the
# invalid codepage characters removed
import sys
if int(sys.version[0]) != 3:
print('Aborted: Python 3.x required')
sys.exit(1)
def bomType(file):
"""
returns file encoding string for open() function
EXAMPLE:
bom = bomtype(file)
open(file, encoding=bom, errors='ignore')
"""
f = open(file, 'rb')
b = f.read(4)
f.close()
if (b[0:3] == b'\xef\xbb\xbf'):
return "utf8"
# Python automatically detects endianess if utf-16 bom is present
# write endianess generally determined by endianess of CPU
if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
return "utf16"
if ((b[0:5] == b'\xfe\xff\x00\x00')
or (b[0:5] == b'\x00\x00\xff\xfe')):
return "utf32"
# If BOM is not provided, then assume its the codepage
# used by your operating system
return "cp1252"
# For the United States its: cp1252
def OpenRead(file):
bom = bomType(file)
return open(file, 'r', encoding=bom, errors='ignore')
#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()
fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()
# this case is still treated like codepage cp1252
# (User responsible for making sure that all utf8 files
# have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there. barf(\x81\x8D\x90\x9D)")
fout.close()
# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()
# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline()
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()
# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline()
print(L)
fin.close()
# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()
Upvotes: 6
Reputation: 7654
If you know the some content of the file you can try to decode it with several encoding and see which is missing. In general there is no way since a text file is a text file and those are stupid ;)
Upvotes: 1
Reputation: 127587
It is, in principle, impossible to determine the encoding of a text file, in the general case. So no, there is no standard Python library to do that for you.
If you have more specific knowledge about the text file (e.g. that it is XML), there might be library functions.
Upvotes: 5