Reputation: 1554
In my database I get the error
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column
I use Java and MySQL 5. As I know 4-byte Unicode is legal i Java, but illegal in MySQL 5, I think that it can cause my problem and I want to check type of my data, so here's my question: How can i check that my UTF-8 data is 3-byte or 4-byte Unicode?
Upvotes: 8
Views: 12855
Reputation: 140236
If you do not want to support beyond BMP, you can just strip those characters before handing it to MySQL:
public static String withNonBmpStripped( String input ) {
if( input == null ) throw new IllegalArgumentException("input");
return input.replaceAll("[^\\u0000-\\uFFFF]", "");
}
If you want to support beyond BMP, you need MySQL 5.5+ and you need to change everything that's utf8
to utf8mb4
(collations, charsets ...). But you also need the support for this in the driver that I am
not familiar with. Handling these characters in Java is also a pain because they are spread over 2 chars
and thus need special handling in many operations.
Upvotes: 10
Reputation: 626
Best approach to strip non-BMP charactres in java that I found is the following:
inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
Upvotes: 6
Reputation: 1502786
UTF-8 encodes everything in the basic multilingual plane (i.e. U+0000 to U+FFFF inclusive) in 1-3 bytes. Therefore, you just need to check whether everything in your string is in the BMP.
In Java, that means checking whether any char
(which is a UTF-16 code unit) is a high or low surrogate character, as Java will use surrogate pairs to encode non-BMP characters:
public static boolean isEntirelyInBasicMultilingualPlane(String text) {
for (int i = 0; i < text.length(); i++) {
if (Character.isSurrogate(text.charAt(i))) {
return false;
}
}
return true;
}
Upvotes: 18