Eduardo
Eduardo

Reputation: 7141

Filtering non MySQL Latin1 Characters from a String in Java

I have a MySQL table that uses latin1, unfortunately I cannot change this.

Before inserting Strings into this table I would like to check if a String contains a character that is not part of the latin1 character set. This way I can remove it from my data set.

How can I do this?

e.g

boolean hasNonLatin1Chars = string.chars()
                .anyMatch(c -> ...)

Upvotes: 2

Views: 1622

Answers (3)

Rick James
Rick James

Reputation: 142383

If your source data is consistently UTF8, then say so. Then you get the best of both worlds -- UTF8 characters that have a transliteration to latin1 will be changed; those that don't will come out as '?'.

Use this in the getConnection() call:

?useUnicode=yes&characterEncoding=UTF-8

No testing for bad characters, no conversion in your code. MySQL does all the work automagically.

Upvotes: 0

leonbloy
leonbloy

Reputation: 75986

To keep it simple and robust, take advantage of CharsetEncoder :

/** replaces any invalid character in Latin1 by the character rep */
public static String latin1(String str, char rep) {
    CharsetEncoder cs = StandardCharsets.ISO_8859_1.newEncoder()
            .onMalformedInput(CodingErrorAction.REPLACE)
            .onUnmappableCharacter(CodingErrorAction.REPLACE)
            .replaceWith(new byte[] { (byte) rep });
    try {
        ByteBuffer b = cs.encode(CharBuffer.wrap(str));
        return new String(b.array(), StandardCharsets.ISO_8859_1);
    } catch (CharacterCodingException e) {
        throw new RuntimeException(e); // should not happen
    }
}

This will replace each invalid charset in ISO_8859_1 (= Latin1) by the replacement character rep (which, of course, should be a valid Latin1 char).

If you are ok with the default replacement ('?'), you can make it simpler:

public static String latin1(String str) {
    return new String(str.getBytes(StandardCharsets.ISO_8859_1),
          StandardCharsets.ISO_8859_1);
}

For example:

public static void main(String[] args)  {
    String x = "hi Œmar!";
    System.out.println("'" + x + "' -> '" + latin1(x,'?') + "'");
}

outputs 'hi Œmar!' -> 'hi ?mar!'

A possible drawback of this approach is that only allows you to replace each invalid character by a single replacement character - you cannot remove it or use a multi-character sequence. If you want this, and if are reasonably sure that some character will never appear in your string, you can go for the usual dirty tricks - for example, assuming the \u0000 will never appear:

/* removes invalid Latin1 charaters - assumes the zero character never appears */
public static String latin1removeinvalid(String str) {
    return latin1(str,(char)0).replace("\u0000", "");
}

Added: if you only want to check for validity, then it's simpler:

public static boolean isValidLatin1(String str) {
    return StandardCharsets.ISO_8859_1.newEncoder().canEncode(str);
}

Upvotes: 2

achAmháin
achAmháin

Reputation: 4266

The Basic Latin range is 0020–007F, so you could check if trying to replace the first instance of a non-latin character matches the original String:

boolean hasNonLatin1Chars = string.equals((string.replaceFirst("[^\\u0020-\\u007F]", "")));

This will return false if it contains a non-latin character.

There is Latin-1 Supplement (00A0 — 00FF), Latin Extended-A (0100 — 017F) and Latin Extended-B (0180 — 024F) so you can modify the range if necessary.

Upvotes: -1

Related Questions