How can I check if a String is encodable in some encoding?

Question

The following test fails on converted Latin1, because illegal characters are replaced with byte with the value 63 (question mark). The problem is that these characters should better cause some exception ...

  @Test
  public void testEncoding() throws UnsupportedEncodingException {
    final String czech = "Řízeček a šampáňo a žízeň";
    // okay
    final byte[] bytesInLatin2 = czech.getBytes("ISO8859-2");
    // different bytes, but okay
    final byte[] bytesInWin1250 = czech.getBytes("Windows-1250");
    // different bytes, but okay
    final byte[] bytesInUtf8 = czech.getBytes("UTF-8");
    // nonsense; Ř,č,... are not in Latin1 code set!!!
    final byte[] bytesInLatin1 = czech.getBytes("ISO8859-1");

    System.out.println(Arrays.toString(bytesInLatin2));
    System.out.println(Arrays.toString(bytesInWin1250));
    System.out.println(Arrays.toString(bytesInUtf8));
    System.out.println(Arrays.toString(bytesInLatin1));
    System.out.flush();

    final String latin2 = new String(bytesInLatin2, "ISO8859-2");
    final String win1250 = new String(bytesInWin1250, "Windows-1250");
    final String utf8 = new String(bytesInUtf8, "UTF-8");
    final String latin1 = new String(bytesInLatin1, "ISO8859-1");

    Assert.assertEquals("latin2", czech, latin2);
    Assert.assertEquals("win1250", czech, win1250);
    Assert.assertEquals("utf8", czech, utf8);
    Assert.assertEquals("latin1", czech, latin1); // this test will fail!
  }

There are many situations where the data are finally corrupted because of this behaviour of Java. Is there any library available to validate Strings if they are encodable with some encoding?

Jon Skeet · Accepted Answer

I suspect you're looking for CharsetEncoder.canEncode(CharSequence).

Charset latin2 = Charset.forName("ISO8859-2");
boolean validInLatin2 = latin2.newEncoder().canEncode(czech);
...

How can I check if a String is encodable in some encoding?

Answers (2)

Related Questions