dmatej
dmatej

Reputation: 1596

How can I check if a String is encodable in some encoding?

The following test fails on converted Latin1, because illegal characters are replaced with byte with the value 63 (question mark). The problem is that these characters should better cause some exception ...

  @Test
  public void testEncoding() throws UnsupportedEncodingException {
    final String czech = "Řízeček a šampáňo a žízeň";
    // okay
    final byte[] bytesInLatin2 = czech.getBytes("ISO8859-2");
    // different bytes, but okay
    final byte[] bytesInWin1250 = czech.getBytes("Windows-1250");
    // different bytes, but okay
    final byte[] bytesInUtf8 = czech.getBytes("UTF-8");
    // nonsense; Ř,č,... are not in Latin1 code set!!!
    final byte[] bytesInLatin1 = czech.getBytes("ISO8859-1");

    System.out.println(Arrays.toString(bytesInLatin2));
    System.out.println(Arrays.toString(bytesInWin1250));
    System.out.println(Arrays.toString(bytesInUtf8));
    System.out.println(Arrays.toString(bytesInLatin1));
    System.out.flush();

    final String latin2 = new String(bytesInLatin2, "ISO8859-2");
    final String win1250 = new String(bytesInWin1250, "Windows-1250");
    final String utf8 = new String(bytesInUtf8, "UTF-8");
    final String latin1 = new String(bytesInLatin1, "ISO8859-1");

    Assert.assertEquals("latin2", czech, latin2);
    Assert.assertEquals("win1250", czech, win1250);
    Assert.assertEquals("utf8", czech, utf8);
    Assert.assertEquals("latin1", czech, latin1); // this test will fail!
  }

There are many situations where the data are finally corrupted because of this behaviour of Java. Is there any library available to validate Strings if they are encodable with some encoding?

Upvotes: 6

Views: 4644

Answers (2)

James Holderness
James Holderness

Reputation: 23011

As an alternative to Jon Skeet's suggestion, you can also use CharsetEncoder class to do the encoding directly (with the encode method), but first call the onMalformedInput and onUnmappableCharacter methods to specify what the encoder should do when it encounters bad input.

That way most of the time you're just doing a simple encode call, but if anything goes wrong you'll get an exception.

Upvotes: 1

Jon Skeet
Jon Skeet

Reputation: 1502806

I suspect you're looking for CharsetEncoder.canEncode(CharSequence).

Charset latin2 = Charset.forName("ISO8859-2");
boolean validInLatin2 = latin2.newEncoder().canEncode(czech);
...

Upvotes: 11

Related Questions