Eli
Eli

Reputation: 38979

Check if String is Valid MySQL UTF8?

I have a MySQL column encoded as utf8. That utf8 is not actually the full utf8 set, but only BMP characters only up to 3 bytes in length. I don't want to try to insert utf8 into MySQL only to find it does not meet MySQL's parameters for what utf8 should be. Is there a way to test in Python if something meets MySQL's parameters before attempting to insert? For obvious reasons, catching exceptions on some_string.encode('utf-8') is not strict enough.

Upvotes: 4

Views: 2257

Answers (2)

user149341
user149341

Reputation:

To check whether a string contains a Unicode character above U+FFFF (and which thus can't be stored in a MySQL table using the "utf8" encoding), you can use the following regular expression:

re.match(u"[^\u0000-\uffff]", s)

Alternatively, if you can upgrade to MySQL 5.5 or later, you may want to consider converting your table to the utf8mb4 character set, which can store all Unicode characters.

Upvotes: 2

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799082

>>> len(u'\uffff'.encode('utf8')) < 4 # Good; fits in utf8
True
>>> len(u'\U00010000'.encode('utf8')) < 4 # Bad; utf8mb4 only
False
>>> ord(u'\uffff') < 65536 # Good; fits in utf8
True
>>> ord(u'\U00010000') < 65536 # Bad; utf8mb4 only
False

Upvotes: 4

Related Questions