C C
C C

Reputation: 449

Is it ok to utf8_decode entire SQL statement to avoid utf-8 / utf8mb4 issues?

I have a site that accepts form-based inputs from users and stores into a mysql(i) database. 99% of users are from the USA.

If a user is on a mobile device and decides it would be fun to use an emoji in one of the form fields, the resulting SQL causes a failure when the UPDATE or INSERT is attempted. This is due to the emoji being outside utf-8 and therefore a mis-match of encoding between the UX and the DB. I believe I understand the underlying issue.

I'm on mysql 5.5 so I could actually support utf8mb4 char set -- but it'll be some time before I can get around to that.

In the meantime, would it be ok/safe to just do this (in PHP):

$sql = "INSERT INTO my_table ... *long insert statement across two dozen rows*";
$sql = utf8_decode($sql);
$db->sql_query($sql);

So - simply force the entire SQL statement to utf-8? I realize emoji characters and some others may be converted to "?" -- but as long as the statement itself isn't broken or I don't lose normal text inputs from the users, I'll be ok with this.

Any issues with this idea as a temp fix until I can convert the table's schemas to work with utf8mb4?

EDIT: Error being returned from the database looks like this:

"Incorrect string value: '\xF0\x9F\x99\x8A...' for column..."

Upvotes: 3

Views: 1281

Answers (1)

deceze
deceze

Reputation: 522125

Some facts:

  • many emoji are above the BMP (basic multilingual plane) in Unicode, i.e. above code point U+FFFF
  • MySQL's utf8 charset can only represent the BMP, it is not actually full UTF-8 (yay MySQL)
  • for full Unicode support MySQL has utf8mb4
  • there should be no actual error when trying to store characters above the BMP in a utf8 charset column, MySQL will simply discard the unsupported characters (yay silent data corruption)
  • utf8_decode does not "force everything to UTF-8", rather it converts a string from UTF-8 encoding to ISO-8859-1 ("Latin-1") encoding; this will discard many more characters than just emoji

So it looks to me like you have some other underlying problem if you get an actual error thrown at some point. Maybe you think you're talking to your database in UTF-8 when you're actually not. You need to set your connection charset via mysqli_set_charset.

"Filtering out" characters your database does not support should already happen, MySQL will simply discard those characters. If you want to do this manually in PHP, you could do this to filter out all characters above U+FFFF:

$string = preg_replace_callback('/./u', function (array $m) {
    return strlen($m[0]) > 4 ? null : $m[0];
}, $string);

Overall: supporting utf8mb4 takes about two minutes. You just need to set your table/column charset to utf8mb4 and do the same with your mysqli connection. If you're then sending actual UTF-8 data from your PHP app to your database, you'll be storing emoji just fine.

Upvotes: 3

Related Questions