Reputation: 4554

Is the String Constructor from UTF-8 Broken?

I have the following code that loads a null terminated multi-byte string from a buffer. It nominally interprets the data as UTF-8 but, if that conversion fails, it then interprets the data as ISO-8859-1. Here is the code:

@Override
   public String format(String date_format, boolean use_locale, int precision)
   {
      String rtn = null;
      int len = 0;
      for(int i = 0; i < max_len; ++i)
      {
         if(storage[storage_offset + i] != 0)
            ++len;
         else
            break;
      }
      try
      {
         rtn = new String(storage, storage_offset, len, "UTF-8");
      }
      catch(UnsupportedEncodingException e1)
      {
         try
         {
            rtn = new String(storage, storage_offset, len, "ISO-8859-1");
         }
         catch(UnsupportedEncodingException e2)
         { }
      }
      return rtn;
   }

My intention is that, if the string decode fails for UTF-8, we can fall back. This is dependent upon the UnsupportedEncodingException being thrown. I have run a test of this code that passes extended characters (codes greater than 128) without the expected UTF-8 pattern. What I have found is that the exception is NOT being thrown and unknown glyphs are being shown for the converted string. My question is whether there has been any change to the standard library implementation that would cause the exception NOT to be thrown?

Upvotes: 0

Answers (3)

Andie2302

Reputation: 4887

You could test if the charset is available.
To get available charsets use:

SortedMap<String, Charset> availableCharsets = Charset.availableCharsets();
    for (Map.Entry<String, Charset> entrySet : availableCharsets.entrySet()) {
        String key = entrySet.getKey();
        Charset value = entrySet.getValue();
        System.out.println("key: " + key + " value: " + value.name());
    }
    System.out.println("The default Charset is: " + Charset.defaultCharset().name());

Upvotes: 0

Buddy

Reputation: 11028

According to the docs for that String constructor, UnsupportedEncodingException is only thrown if the specified charsetName is unknown.

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.

Upvotes: 2

yshavit

Reputation: 43391

The UnsupportedEncodingException is thrown if the charset itself is unsupported (that is, you specify a charset and the system doesn't recognize the name) -- not if the bytes don't encode correctly. Note that the corresponding constructor that takes a java.nio.charset.Charset does not throw that exception (since there's no name to map to a Charset, and thus no possibility that the mapping isn't there).

The docs for String(byte[], int, int, String) specify the behavior (namely, that it's unspecified :) ) and suggest the fix:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.

Upvotes: 2

Is the String Constructor from UTF-8 Broken?

Answers (3)

Related Questions