Reputation: 17923
We had a a clob column in DB. Now when we extract this clob and try to display it (plain text not html), it prints junk some characters on html screen. The character when directly streamed to a file looks like ”
(not the usual double quote on regular keyboard)
One more observation:
System.out.println("”".getBytes()[0]);
prints -108
.
Why a character byte should be in negative range ? Is there any way to display it correctly on a html screen ?
Upvotes: 2
Views: 879
Reputation: 18397
To answer your question about displaying the character correctly in an HTML document, you need to do one of two things: either set the encoding of the document or entity-ize the non-ascii characters.
To set the encoding you have two options.
Content-Type:
text/html; charset=UTF-8
.<meta charset="UTF-8" />
tag to
the head section of your page.Keep in mind that Option 1 will take precedence over option 2. I.e. if you are already setting an incorrect charset in the header, you can't override it with a meta tag.
The other option is to entity-ize the non ASCII characters. For the quote character in your question you could use ”
or ”
or ”
. The first is a user friendly named entity, the second specifies the Unicode code point of the character in decimal, and the third specifies the code point in hex. All are valid and all will work.
Generally if you are going to entity-ize dynamic content out of a database that contains unknown characters you're best off just using the code point versions of the entities as you can easily write a method to convert any character >127 to its appropriate code point.
One of the systems I currently work on actually ran into this issue where we took data from a UTF-8 source and had to serve HTML pages with no control over the Content-Type header. We actually ended up writing a custom java Charset which could convert a stream of Java characters into an ASCII encoded byte stream with all non-ASCII characters converted to entities. Then we just wrapped the output stream in a Writer with that Charset and output everything as usual. There are a few gotchas in implementing a Charset correctly, but simply doing the encoding yourself is pretty straight forward, just be sure to handle the surrogate pairs correctly.
Upvotes: 0
Reputation: 42617
Re: your final observation - Java bytes are always signed. To interpret them as unsigned, you can bitwise AND them with an int:
byte[] bytes = "”".getBytes("UTF-8");
for(byte b: bytes)
{
System.out.println(b & 0xFF);
}
which outputs:
226
128
157
Note that your string is actually three bytes long in UTF-8.
As pointed out in the comments, it depends on the encoding. For UTF-16 you get:
254
255
32
29
and for US-ASCII or ISO-8859-1 you get
63
which is a question-mark (i.e. "I dunno, some new-fangled character"). Note that:
The behavior of this method [
getBytes()
] when this string cannot be encoded in the given charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.
Upvotes: 2