Drxxd
Drxxd

Reputation: 1929

How to validate that a string is a valid UTF-8 string in python 2.7

I have the following string -

"\xed\xad\x80\xed\xb1\x93"

When using this string to execute queries in the PostgreSQL DB, it raises the following error -

DataError: invalid byte sequence for encoding "UTF8": 0xed 0xad 0x80

When testing it in python 2.7 (before executing the query) it doesn't raise an exception -

Windows test -

'\xed\xad\x80\xed\xb1\x93'.decode("utf-8")
u'\U000e0053'

Linux test -

'\xed\xad\x80\xed\xb1\x93'.decode("utf-8")
u'\udb40\udc53'

In python3, it actually raises an exception -

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

How can I check in python 2.7 that it's not a valid utf-8 string?

Upvotes: 0

Views: 702

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 247235

It is a valid UTF-8 code, but it does not belong to a character.

0xEDAD80 converts to UNICODE code point DB40, which is a “high surrogate” and not a character as such.

So these data are not UTF-8 encoded characters. It makes no sense to encode surrogates in UTF-8, they are normally used in encodings like UTF-16 and UCS-2.

RFC 3629 actually declares that encoding surrogates is not allowed:

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

So that sounds like a bug in Python v2, and you can report it as such.

Upvotes: 1

Related Questions