Reputation: 17923

Character encoding issues?

We had a a clob column in DB. Now when we extract this clob and try to display it (plain text not html), it prints junk some characters on html screen. The character when directly streamed to a file looks like ” (not the usual double quote on regular keyboard)

One more observation:

System.out.println("”".getBytes()[0]);

prints -108.

Why a character byte should be in negative range ? Is there any way to display it correctly on a html screen ?

Upvotes: 2

Answers (3)

Mike Deck

Reputation: 18397

To answer your question about displaying the character correctly in an HTML document, you need to do one of two things: either set the encoding of the document or entity-ize the non-ascii characters.

To set the encoding you have two options.

Update your web server to send an appropriate charset argument in the Content-Type header. The correct header would be Content-Type: text/html; charset=UTF-8.
Add a <meta charset="UTF-8" /> tag to the head section of your page.

Keep in mind that Option 1 will take precedence over option 2. I.e. if you are already setting an incorrect charset in the header, you can't override it with a meta tag.

The other option is to entity-ize the non ASCII characters. For the quote character in your question you could use ” or ” or ”. The first is a user friendly named entity, the second specifies the Unicode code point of the character in decimal, and the third specifies the code point in hex. All are valid and all will work.

Generally if you are going to entity-ize dynamic content out of a database that contains unknown characters you're best off just using the code point versions of the entities as you can easily write a method to convert any character >127 to its appropriate code point.

One of the systems I currently work on actually ran into this issue where we took data from a UTF-8 source and had to serve HTML pages with no control over the Content-Type header. We actually ended up writing a custom java Charset which could convert a stream of Java characters into an ASCII encoded byte stream with all non-ASCII characters converted to entities. Then we just wrapped the output stream in a Writer with that Charset and output everything as usual. There are a few gotchas in implementing a Charset correctly, but simply doing the encoding yourself is pretty straight forward, just be sure to handle the surrogate pairs correctly.

Upvotes: 0

DNA

Reputation: 42617

Re: your final observation - Java bytes are always signed. To interpret them as unsigned, you can bitwise AND them with an int:

byte[] bytes = "”".getBytes("UTF-8");
for(byte b: bytes)
{
    System.out.println(b & 0xFF);
}

which outputs:

226 
128
157

Note that your string is actually three bytes long in UTF-8.

As pointed out in the comments, it depends on the encoding. For UTF-16 you get:

and for US-ASCII or ISO-8859-1 you get

which is a question-mark (i.e. "I dunno, some new-fangled character"). Note that:

The behavior of this method [getBytes()] when this string cannot be encoded in the given charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.

Upvotes: 2

gkuzmin

Reputation: 2484

I think that it will be better to print character code like this way:

System.out.println((int)'”');//result is 8221

This link can help you to explain this extraordinary double quote (include html code).

Upvotes: 2

Character encoding issues?

Answers (3)

Related Questions