priya
priya

Reputation: 26749

How do I identify if the string contains a special character which cannot be stored using a utf8-mb4 character set

Refer to this tweet and the following thread were we are trying to store a similar tweet into the database. I am unable to store this tweet in MySQL, I would like to know how to identify, if the string contains a character which cannot be processed by the utf8-mb4 character set, so that I can avoid storing it.

Upvotes: 10

Views: 2665

Answers (3)

Unai Vivi
Unai Vivi

Reputation: 3111

If you want to avoid storing troublesome characters (the rare fancy characters outside the Basic Multilingual Plane, that give you problems), you can parse the String's characters and discard the String if it contains codepoints for which Character.charCount returns 2, or for which Character.isSupplementaryCodePoint returns true.

This way, as you asked, you can avoid storing those strings that (for some reason) your DBMS has trouble with.

Sources: see javadoc for

  • Character.charCount
  • Character.isSupplementaryCodePoint

and, while you're at it

  • String.codePointAt
  • String.codePointCount

Upvotes: 0

Bhavik Ambani
Bhavik Ambani

Reputation: 6657

Instead of finding the special character of the string you can do one thing you can convert the string into Hex format and then back you can convert that into previous string

public static synchronized String toHex(byte [] buf){
    StringBuffer strbuf = new StringBuffer(buf.length * 2);
    int i;
    for (i = 0; i < buf.length; i++) {
        if (((int) buf[i] & 0xff) < 0x10){
            strbuf.append("0");
        }
        strbuf.append(Long.toString((int) buf[i] & 0xff, 16));
    }
    return strbuf.toString();
}

By using the below function you can convert back to original string

public synchronized static byte[] hexToBytes(String hexString) {
    HexBinaryAdapter adapter = new HexBinaryAdapter();
    byte[] bytes = adapter.unmarshal(hexString);
    return bytes;
}

Upvotes: 1

Tassos Bassoukos
Tassos Bassoukos

Reputation: 16152

The character that poses a problem for you is U+1F603 SMILING FACE WITH OPEN MOUTH, which has a value not representable in 16 bits. When converted to UTF-8 the byte values are f0 9f 98 83, which should fit without issues in a utf8mb4 character set MySQL column, so I will agree with the other commenters that it doesn't look to be a MySQL issue. If you can attempt to re-insert this tweet, log all SQL statements as received by MySQL to determine if the characters get corrupted before or after sending them to MySQL.

Upvotes: 4

Related Questions