Reputation: 26749
Refer to this tweet and the following thread were we are trying to store a similar tweet into the database. I am unable to store this tweet in MySQL, I would like to know how to identify, if the string contains a character which cannot be processed by the utf8-mb4 character set, so that I can avoid storing it.
Upvotes: 10
Views: 2665
Reputation: 3111
If you want to avoid storing troublesome characters (the rare fancy characters outside the Basic Multilingual Plane, that give you problems), you can parse the String
's characters and discard the String
if it contains codepoints for which Character.charCount
returns 2
, or for which Character.isSupplementaryCodePoint
returns true
.
This way, as you asked, you can avoid storing those strings that (for some reason) your DBMS has trouble with.
Sources: see javadoc for
Character.charCount
Character.isSupplementaryCodePoint
and, while you're at it
String.codePointAt
String.codePointCount
Upvotes: 0
Reputation: 6657
Instead of finding the special character of the string you can do one thing you can convert the string into Hex format and then back you can convert that into previous string
public static synchronized String toHex(byte [] buf){
StringBuffer strbuf = new StringBuffer(buf.length * 2);
int i;
for (i = 0; i < buf.length; i++) {
if (((int) buf[i] & 0xff) < 0x10){
strbuf.append("0");
}
strbuf.append(Long.toString((int) buf[i] & 0xff, 16));
}
return strbuf.toString();
}
By using the below function you can convert back to original string
public synchronized static byte[] hexToBytes(String hexString) {
HexBinaryAdapter adapter = new HexBinaryAdapter();
byte[] bytes = adapter.unmarshal(hexString);
return bytes;
}
Upvotes: 1
Reputation: 16152
The character that poses a problem for you is U+1F603 SMILING FACE WITH OPEN MOUTH
, which has a value not representable in 16 bits. When converted to UTF-8 the byte values are f0 9f 98 83
, which should fit without issues in a utf8mb4
character set MySQL column, so I will agree with the other commenters that it doesn't look to be a MySQL issue. If you can attempt to re-insert this tweet, log all SQL statements as received by MySQL to determine if the characters get corrupted before or after sending them to MySQL.
Upvotes: 4