Reputation: 38979
I have a MySQL column encoded as utf8. That utf8 is not actually the full utf8 set, but only BMP characters only up to 3 bytes in length. I don't want to try to insert utf8 into MySQL only to find it does not meet MySQL's parameters for what utf8 should be. Is there a way to test in Python if something meets MySQL's parameters before attempting to insert? For obvious reasons, catching exceptions on some_string.encode('utf-8')
is not strict enough.
Upvotes: 4
Views: 2257
Reputation:
To check whether a string contains a Unicode character above U+FFFF (and which thus can't be stored in a MySQL table using the "utf8" encoding), you can use the following regular expression:
re.match(u"[^\u0000-\uffff]", s)
Alternatively, if you can upgrade to MySQL 5.5 or later, you may want to consider converting your table to the utf8mb4 character set, which can store all Unicode characters.
Upvotes: 2
Reputation: 799082
>>> len(u'\uffff'.encode('utf8')) < 4 # Good; fits in utf8
True
>>> len(u'\U00010000'.encode('utf8')) < 4 # Bad; utf8mb4 only
False
>>> ord(u'\uffff') < 65536 # Good; fits in utf8
True
>>> ord(u'\U00010000') < 65536 # Bad; utf8mb4 only
False
Upvotes: 4