Reputation: 7141
I have a MySQL table that uses latin1, unfortunately I cannot change this.
Before inserting Strings into this table I would like to check if a String contains a character that is not part of the latin1 character set. This way I can remove it from my data set.
How can I do this?
e.g
boolean hasNonLatin1Chars = string.chars()
.anyMatch(c -> ...)
Upvotes: 2
Views: 1622
Reputation: 142383
If your source data is consistently UTF8, then say so. Then you get the best of both worlds -- UTF8 characters that have a transliteration to latin1 will be changed; those that don't will come out as '?'.
Use this in the getConnection()
call:
?useUnicode=yes&characterEncoding=UTF-8
No testing for bad characters, no conversion in your code. MySQL does all the work automagically.
Upvotes: 0
Reputation: 75986
To keep it simple and robust, take advantage of CharsetEncoder
:
/** replaces any invalid character in Latin1 by the character rep */
public static String latin1(String str, char rep) {
CharsetEncoder cs = StandardCharsets.ISO_8859_1.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.replaceWith(new byte[] { (byte) rep });
try {
ByteBuffer b = cs.encode(CharBuffer.wrap(str));
return new String(b.array(), StandardCharsets.ISO_8859_1);
} catch (CharacterCodingException e) {
throw new RuntimeException(e); // should not happen
}
}
This will replace each invalid charset in ISO_8859_1 (= Latin1) by the replacement character rep
(which, of course, should be a valid Latin1 char).
If you are ok with the default replacement ('?'
), you can make it simpler:
public static String latin1(String str) {
return new String(str.getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.ISO_8859_1);
}
For example:
public static void main(String[] args) {
String x = "hi Œmar!";
System.out.println("'" + x + "' -> '" + latin1(x,'?') + "'");
}
outputs 'hi Œmar!' -> 'hi ?mar!'
A possible drawback of this approach is that only allows you to replace each invalid character by a single replacement character - you cannot remove it or use a multi-character sequence.
If you want this, and if are reasonably sure that some character will never appear in your string, you can go for the usual dirty tricks - for example, assuming the \u0000
will never appear:
/* removes invalid Latin1 charaters - assumes the zero character never appears */
public static String latin1removeinvalid(String str) {
return latin1(str,(char)0).replace("\u0000", "");
}
Added: if you only want to check for validity, then it's simpler:
public static boolean isValidLatin1(String str) {
return StandardCharsets.ISO_8859_1.newEncoder().canEncode(str);
}
Upvotes: 2
Reputation: 4266
The Basic Latin range is 0020–007F
, so you could check if trying to replace the first instance of a non-latin character matches the original String
:
boolean hasNonLatin1Chars = string.equals((string.replaceFirst("[^\\u0020-\\u007F]", "")));
This will return false
if it contains a non-latin character.
There is Latin-1 Supplement (00A0 — 00FF
), Latin Extended-A (0100 — 017F
) and Latin Extended-B (0180 — 024F
) so you can modify the range if necessary.
Upvotes: -1