Meilun Sheng
Meilun Sheng

Reputation: 129

Remove Non-Ansi Chars from a UTF String and Keep Others

We have a java lib accpeting a UTF8 string as the input. But if there is any char which is a non-ansi char in the input, the lib may crash. So, we want to remove all non-ansi char from the string. But how to do that in java?

Thanks,

Upvotes: 0

Views: 1147

Answers (2)

Java Devil
Java Devil

Reputation: 10959

Try this, I pulled this from here so haven't tested it

// Create a encoder and decoder for the character encoding
Charset charset = Charset.forName("US-ASCII");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();

// This line is the key to removing "unmappable" characters.
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
String result = inString;

try {
    // Convert a string to bytes in a ByteBuffer
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(inString));

    // Convert bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    CharBuffer cbuf = decoder.decode(bbuf);
    result = cbuf.toString();
} catch (CharacterCodingException cce) {
    String errorMessage = "Exception during character encoding/decoding: " + cce.getMessage();
    cce.printStackTrace()
}

Upvotes: 1

Steven M. Wurster
Steven M. Wurster

Reputation: 360

Take a look at String.codePointAt(index). That can give you the Unicode code point for a given character, and from there you could remove those outside your range.

How you handle the fact that a character has been removed is on your end, but keep in mind that the string you'll be sending to the library isn't necessarily the same as that provided by the client. This may or may not cause problems.

I'm not sure what you mean by ANSI here. Do you mean the Windows 1252 character encoding that people typically call ANSI? That's not ASCII and it's also not IS0-8859-1, so make sure you get your code pages correct.

Upvotes: 0

Related Questions