Reputation: 824
I want to test if a string of bytes that I'm extracting from a file results in valid ISO-8859-15 encoded text. The first thing I came across is this similar case about UTF-8 validation:
https://stackoverflow.com/a/5259160/1209004
So based on that, I thought I was being clever by doing something similar for ISO-8859-15. See the following demo code:
#! /usr/bin/env python
#
def isValidISO885915(bytes):
# Test if bytes result in valid ISO-8859-15
try:
bytes.decode('iso-8859-15', 'strict')
return(True)
except UnicodeDecodeError:
return(False)
def main():
# Test bytes (byte x95 is not defined in ISO-8859-15!)
bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'
isValidLatin = isValidISO885915(bytes)
print(isValidLatin)
main()
However, running this returns True, even though x95 is not a valid code point in ISO-8859-15! Am I overlooking something really obvious here? (BTW I tried this with Python 2.7.4 and 3.3, results are identical in both cases).
Upvotes: 2
Views: 1885
Reputation: 824
I think I've found a workable solution myself, so I might as well share it.
Looking at the codepage layout of ISO 8859-15 (see here), I really only need to check for the presence of code points 00 -1f and 7f - 9f. These corrrepond to the C0 and C1 control codes.
In my project I was already using something based on the code here for removing control characters from a string (C0 + C1). So, using that as a basis I came up with this:
#! /usr/bin/env python
#
import unicodedata
def removeControlCharacters(string):
# Remove control characters from string
# Based on: https://stackoverflow.com/a/19016117/1209004
# Tab, newline and return are part of C0, but are allowed in XML
allowedChars = [u'\t', u'\n',u'\r']
return "".join(ch for ch in string if
unicodedata.category(ch)[0] != "C" or ch in allowedChars)
def isValidISO885915(bytes):
# Test if bytes result in valid ISO-8859-15
# Decode bytes to string
try:
string = bytes.decode("iso-8859-15", "strict")
except:
# Empty string in case of decode error
string = ""
# Remove control characters, and compare result against
# input string
if removeControlCharacters(string) == string:
isValidLatin = True
else:
isValidLatin = False
return(isValidLatin)
def main():
# Test bytes (byte x95 is not defined in ISO-8859-15!)
bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'
print(isValidISO885915(bytes))
main()
There may be more elegant / Pythonic ways to do this, but it seems to do the trick, and works with both Python 2.7 and 3.3.
Upvotes: 1