johan
johan

Reputation: 824

Check if bytes result in valid ISO 8859-15 (Latin) in Python

I want to test if a string of bytes that I'm extracting from a file results in valid ISO-8859-15 encoded text. The first thing I came across is this similar case about UTF-8 validation:

https://stackoverflow.com/a/5259160/1209004

So based on that, I thought I was being clever by doing something similar for ISO-8859-15. See the following demo code:

#! /usr/bin/env python
#

def isValidISO885915(bytes):
    # Test if bytes result in valid ISO-8859-15
    try:
        bytes.decode('iso-8859-15', 'strict')
        return(True)
    except UnicodeDecodeError:
        return(False)

def main():
    # Test bytes (byte x95 is not defined in ISO-8859-15!)
    bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'

    isValidLatin = isValidISO885915(bytes)
    print(isValidLatin)

main()

However, running this returns True, even though x95 is not a valid code point in ISO-8859-15! Am I overlooking something really obvious here? (BTW I tried this with Python 2.7.4 and 3.3, results are identical in both cases).

Upvotes: 2

Views: 1885

Answers (1)

johan
johan

Reputation: 824

I think I've found a workable solution myself, so I might as well share it.

Looking at the codepage layout of ISO 8859-15 (see here), I really only need to check for the presence of code points 00 -1f and 7f - 9f. These corrrepond to the C0 and C1 control codes.

In my project I was already using something based on the code here for removing control characters from a string (C0 + C1). So, using that as a basis I came up with this:

#! /usr/bin/env python
#
import unicodedata

def removeControlCharacters(string):
    # Remove control characters from string
    # Based on: https://stackoverflow.com/a/19016117/1209004

    # Tab, newline and return are part of C0, but are allowed in XML
    allowedChars = [u'\t', u'\n',u'\r']
    return "".join(ch for ch in string if 
        unicodedata.category(ch)[0] != "C" or ch in allowedChars)

def isValidISO885915(bytes):
    # Test if bytes result in valid ISO-8859-15

    # Decode bytes to string
    try:
        string = bytes.decode("iso-8859-15", "strict")
    except:
        # Empty string in case of decode error
        string = ""

    # Remove control characters, and compare result against
    # input string
    if removeControlCharacters(string) == string:
        isValidLatin = True
    else:
        isValidLatin = False

    return(isValidLatin)

def main():
    # Test bytes (byte x95 is not defined in ISO-8859-15!)

    bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'

    print(isValidISO885915(bytes)) 


main()

There may be more elegant / Pythonic ways to do this, but it seems to do the trick, and works with both Python 2.7 and 3.3.

Upvotes: 1

Related Questions