Reputation: 1050

Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

I have a Java servlet that takes a parameter String (inputString) that may contain Greek letters from a web page marked up as utf-8. Before I send it to a database I have to convert it to a new String (utf8String) as follows:

String utf8String = new String(inputString.getBytes("8859_1"), "UTF-8");

This works, but, as I hope will be appreciated, I hate doing something I don't understand, even if it works.

From the method description in the Java doc the getBytes() method "Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array" i.e. I am encoding it in 8859_1 — isoLatin. And from the Constructor description "Constructs a new String by decoding the specified array of bytes using the specified charset" i.e. decodes the byte array to utf-8.

Can someone explain to me why this is necessary?

Upvotes: 0

Answers (2)

David

Reputation: 1050

My question is based on a misconception regarding the character set used for the HTTP request. I had assumed that because I marked up the web page from which the request was sent as UTF-8 the request would be sent as UTF-8, and so the Greek characters in the parameter sent to the servlet would be read as a UTF-8 String (‘inputString’ in my line of code) by the HttpRequest.getParameter() method. This is not the case.

HTTP requests are sent as ISO-8859-1 (POST) or ASCII (GET), which are generally the same. This is part of the URI Syntax specification — thanks to Andreas for pointing me to http://wiki.apache.org/tomcat/FAQ/CharacterEncoding where this is explained.

I had also forgotten that the encoding of Greek letters such as α for the request is URL-encoding, which produces %CE%B1. The getParameter() handles this by decoding it as two ISO-8859-1 characters, %CE and %B1 — Î and ± (I checked this).

I now understand why this needs to be turned into a byte array and the bytes interpreted as UTF-8. 0xCE does not represent a one-byte character in UTF-8 and hence it is addressed with the next byte, 0xB1, to be interpretted as α. (Î is 0xC3 0x8E and ± is 0xC2 0xB1 in UTF-8.)

Upvotes: 1

Aaronward

Reputation: 119

When decoding, could you not create a class with a decoder method that takes the bytes [] as a parameter and return it as a string? here is an example that i have used before.

public class Decoder
{           
   public String decode(byte[] bytes) 
   { 
    //Turns the bytes array into a string
    String decodedString = new String(bytes);
    return decodedString;
   }
}

Try use this instead of .getBytes(). hope this works.

Upvotes: 0

Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

Answers (2)

Related Questions