Reputation: 1929
I have the following string -
"\xed\xad\x80\xed\xb1\x93"
When using this string to execute queries in the PostgreSQL DB, it raises the following error -
DataError: invalid byte sequence for encoding "UTF8": 0xed 0xad 0x80
When testing it in python 2.7 (before executing the query) it doesn't raise an exception -
Windows test -
'\xed\xad\x80\xed\xb1\x93'.decode("utf-8")
u'\U000e0053'
Linux test -
'\xed\xad\x80\xed\xb1\x93'.decode("utf-8")
u'\udb40\udc53'
In python3, it actually raises an exception -
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
How can I check in python 2.7 that it's not a valid utf-8 string?
Upvotes: 0
Views: 702
Reputation: 247235
It is a valid UTF-8 code, but it does not belong to a character.
0xEDAD80 converts to UNICODE code point DB40, which is a “high surrogate” and not a character as such.
So these data are not UTF-8 encoded characters. It makes no sense to encode surrogates in UTF-8, they are normally used in encodings like UTF-16 and UCS-2.
RFC 3629 actually declares that encoding surrogates is not allowed:
The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.
So that sounds like a bug in Python v2, and you can report it as such.
Upvotes: 1