Asbjørn Ulsberg
Asbjørn Ulsberg

Reputation: 8820

Convert a bunch of files from guessed encoding to UTF-8

I have this Python script that attempts to detect the character encoding of a text file (in this case, C# .cs source files, but they could be any text file) and then convert them from that character encoding and into UTF-8 (without BOM).

While chardet detects the encoding well enough and the script runs without errors, characters like © are encoded into $. So I assume there's something wrong with the script and my understanding of encoding in Python 2. Since converting files from UTF-8-SIG to UTF-8 works, I have a feeling that the problem is the decoding (reading) part and not the encoding (writing) part.

Can anyone tell me what I'm doing wrong? If switching to Python 3 is a solution, I'm all for it, I then just need help figuring out how to convert the script from running on version 2.7 to 3.4. Here's the script:

import os
import glob
import fnmatch
import codecs
from chardet.universaldetector import UniversalDetector

# from http://farmdev.com/talks/unicode/
def to_unicode_or_bust(obj, encoding='utf-8'):
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

def enforce_unicode():
    detector = UniversalDetector()

    for root, dirnames, filenames in os.walk('.'):
      for filename in fnmatch.filter(filenames, '*.cs'):
        detector.reset()
        filepath = os.path.join(root, filename)

        with open(filepath, 'r') as f:
            for line in f:
                detector.feed(line)
                if detector.done: break

        detector.close()
        encoding = detector.result['encoding']

        if encoding and not encoding == 'UTF-8':
            print '%s -> UTF-8   %s' % (encoding.ljust(12), filepath)
            with codecs.open(filepath, 'r', encoding=encoding) as f:
                content = ''.join(f.readlines())

            content = to_unicode_or_bust(content)

            with codecs.open(filepath, 'w', encoding='utf-8') as f:
                f.write(content)

enforce_unicode()

I have tried to do content = content.decode(encoding).encode('utf-8') before writing the file, but that fails with the following error:

/usr/local/.../lib/python2.7/encodings/utf_8_sig.py:19: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if input[:3] == codecs.BOM_UTF8:
Traceback (most recent call last):
  File "./enforce-unicode.py", line 48, in <module>
    enforce_unicode()
  File "./enforce-unicode.py", line 43, in enforce_unicode
    content = content.decode(encoding).encode('utf-8')
  File "/usr/local/.../lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
    (output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 87: ordinal not in range(128)

Ideas?

Upvotes: 1

Views: 730

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121486

chardet simply got the detected codec it wrong, your code is otherwise correct. Character detection is based on statistics, heuristics and plain guesses, it is not a foolproof method.

For example, the Windows 1252 codepage is very close to the Latin-1 codec; files encoded with the one encoding can be decoded without error in the other encoding. Detecting the difference between a control code in the one or a Euro symbol in the other usually takes a human being looking at the result.

I'd record the chardet guesses for each file, if the file turns out to be wrongly re-coded, you need to look at what other codecs could be close. All of the 1250-series codepages look a lot alike.

Upvotes: 2

Related Questions