Reputation: 9154
I am reading HTTP response from a Perl page in a Servlet like this:
public String getHTML(String urlToRead) {
URL url;
HttpURLConnection conn;
BufferedReader rd;
String line;
String result = "";
try {
url = new URL(urlToRead);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("Accept-Charset", "UTF-8");
conn.setRequestProperty("Content-Type", "text/xml; charset=UTF-8");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
while ((line = rd.readLine()) != null) {
byte [] b = line.getBytes();
result += new String(b, "UTF-8");
}
rd.close();
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
I am displaying this result with this code:
response.setContentType("text/plain; charset=UTF-8");
PrintWriter out = new PrintWriter(new OutputStreamWriter(response.getOutputStream(), "UTF-8"), true);
try {
String query = request.getParameter("query");
String type = request.getParameter("type");
String res = getHTML(url);
out.write(res);
} finally {
out.close();
}
But the response still is not encoded as UTF-8. What am I doing wrong?
Thanks in advance.
Upvotes: 2
Views: 9204
Reputation: 4599
In my case, I have do add another configuration.
Previously, I was writing the page this way:
try (PrintStream printStream = new PrintStream(response.getOutputStream()) {
printStream.print(pageInjecting);
}
I changed to:
try (PrintStream printStream = new PrintStream(response.getOutputStream(), false, "UTF-8")) {
printStream.print(pageInjecting);
}
Upvotes: 0
Reputation: 1
I also faced the same problem in another scenario, but just do it I believe it will work:
byte[] b = line.getBytes(UTF8_CHARSET);
in the while loop:
while ((line = rd.readLine()) != null) {
byte [] b = line.getBytes(); // NOT UTF-8
result += new String(b, "UTF-8");
}
Upvotes: 0
Reputation: 28648
That call to line.getBytes()
looks suspicious. You should probably make it line.getBytes("UTF-8")
if you are certain that what is returned is UTF-8 encoded. Additionally, I'm not sure why it is even necessary. A typical approach to getting data out of a BufferedReader
is to use a StringBuilder
to continue appending each String
retrieved from readLine
into a result. The conversion back and forth between String
and byte[]
is unnecessary.
Change result
into a StringBuilder
and do this:
while ((line = rd.readLine()) != null) {
result.append(line);
}
Upvotes: 3
Reputation: 12817
Here is where you break the chain of character encoding conversions:
while ((line = rd.readLine()) != null) {
byte [] b = line.getBytes(); // NOT UTF-8
result += new String(b, "UTF-8");
}
From String#getBytes() javadoc:
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array
And, defaullt charset is probably not UTF-8.
But why do all the conversions in the first place? Just read the raw bytes from the source and write the raw bytes to the consumer. It's supposed to be UTF-8 all the way.
Upvotes: 2