Mawg
Mawg

Reputation: 40175

Convert file to Ascii is throwing exceptions

As a result of my previous question, I have coded this:

def ConvertFileToAscii(args, filePath):
    try:
       # Firstly, make sure that the file is writable by all, otherwise we can't update it
        os.chmod(filePath, 0o666)

        with open(filePath, "rb") as file:
            contentOfFile = file.read()

        unicodeData = contentOfFile.decode("utf-8")
        asciiData = unicodeData.encode("ascii", "ignore")

        asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')

        temporaryFile = tempfile.NamedTemporaryFile(mode='wt', delete=False)
        temporaryFileName = temporaryFile.name

        with open(temporaryFileName, 'wb')  as file:
            file.write(asciiData)

        if ((args.info) or (args.diagnostics)):
            print(filePath + ' converted to ASCII and stored in ' + temporaryFileName)


        return temporaryFileName

    #
    except KeyboardInterrupt:
        raise

    except Exception as e:
        print('!!!!!!!!!!!!!!!\nException while trying to convert ' + filePath + ' to ASCII')
        print(e)
        exc_type, exc_value, exc_traceback = sys.exc_info()
        print(traceback.format_exception(exc_type, exc_value, exc_traceback))

        if args.break_on_error:
            sys.exit('Break on error\n')

When I run it, I am getting exceptions like this:

['Traceback (most recent call last):
', '  File "/home/ker4hi/tools/xmlExpand/xmlExpand.py", line 99, in ConvertFileToAscii
    unicodeData = contentOfFile.decode("utf-8")
    ', "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1081: invalid start byte"]

What am I doing wrong?

I really don't care about data loss converting them to ASCII.

ox9C is Ü a U with a diacritical mark (Umlaut), I can live without it.

How can I convert such files to contain only the pure Ascii characters? Do I really need to open them as bi8nary and check every byte?

Upvotes: 3

Views: 227

Answers (4)

Raymond Hettinger
Raymond Hettinger

Reputation: 226516

I really don't care about data loss converting them to ASCII. ... How can I convert such files to contain only the pure Ascii characters?

One way is to use the replace option for the decode method. The advantage of replace over ignore is that you get placeholders for missing values which my help prevent a misinterpretation of the text.

Be sure to use ASCII encoding rather than UTF-8. Otherwise, you may lose adjacent ascii characters as the decoder attempts to re-sync.

Lastly, run encode('ascii') after the decoding step. Otherwise, you're left with a unicode string instead of a byte string.

>>> string_of_unknown_encoding = 'L\u00f6wis'.encode('latin-1')
>>> now_in_unicode = string_of_unknown_encoding.decode('ascii', 'replace')
>>> back_to_bytes = now_in_unicode.replace('\ufffd', '?').encode('ascii')
>>> type(back_to_bytes)
<class 'bytes'>
>>> print(back_to_bytes)
b'L?wis'

That said, TheRightWay™ to do this is to start caring about data loss and use the correct encoding (clearly your input isn't in UTF-8 otherwise the decoding wouldn't have failed):

>>> string_of_known_latin1_encoding = 'L\u00f6wis'.encode('latin-1')
>>> now_in_unicode = string_of_known_latin1_encoding.decode('latin-1')
>>> back_to_bytes = now_in_unicode.encode('ascii', 'replace')
>>> type(back_to_bytes)
<class 'bytes'>
>>> print(back_to_bytes)

Upvotes: 1

jfs
jfs

Reputation: 414625

You don't need to load the whole file in memory and call .decode() on it. open() has encoding parameter (use io.open() on Python 2):

with open(filename, encoding='ascii', errors='ignore') as file:
    ascii_char = file.read(1)

If you need an ascii transliteration of Unicode text; consider unidecode.

Upvotes: 1

Ofir
Ofir

Reputation: 8362

Use:

contentOfFile.decode('utf-8', 'ignore')

The exception is from the decode phase, where you didn't ignore the error.

Upvotes: 1

lucasg
lucasg

Reputation: 11012

0x00f6 is ö (ouml ) encoded in ISO-8859-1. My guess is you're using the wrong Unicode decoder.

Try : unicodeData = contentOfFile.decode("ISO-8859-1")

Upvotes: 1

Related Questions