Reputation: 1723
I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)
final URL url = new URL("http://latata.pl/pl.php");
final URLConnection urlConnection = url.openConnection();
final BufferedReader in = new BufferedReader(new InputStreamReader(
urlConnection.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
It doesn't work. :( Any ideas?
Upvotes: 2
Views: 756
Reputation: 241
As someone has already stated there is no charset encoding specified for the response. Forcing the response document to be viewed as ISO-8859-2 (typically used in central Europe) results in legitimate polish characters being displayed, so I assume this is the encoding actually being used. Since no encoding has been specified, ISO-8859-1 will be assumed as this is the default.
The response headers need to include the header Content-Type: text/html; charset=ISO-8859-2 for the character code points to be interpreted correctly. This charset will be used when constructing the response InputStream
.
Upvotes: 2
Reputation: 5499
The output of your php-script pl.php
is faulty. There is a HTTP-header Content-Type: text/html
set without a declared charset. Without a declared charset, the client has to assume that it is ISO-8859-1
regarding to the HTTP-specifications. The sent body is ±ê³ó¿¡Ê£¯¬
if interpreted as ISO-8859-1.
The bytes sended by the php-script are representing ąęłóżĄĘŁŻŹ
if it were declared as
Content-Type: text/html; charset=ISO-8859-2
You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:
final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);
The output will be ąęłóżĄĘŁŻŹ
, which are some polish characters.
As a quick fix, set the charset in your php-script to output Content-Type: text/html; charset=ISO-8859-2
as HTTP-Header.
But you should think about to switch to UTF-8 encoded output anyway.
Upvotes: 2
Reputation: 28293
This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.
Here's what you get back:
$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl
HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html
����ʣ��Connection closed by foreign host.
The HTML is simply:
<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>
And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?
Upvotes: 2
Reputation: 18998
Your InputStreamReader
will be attempting to convert the bytes coming back over the TCP connection using your platform default encoding (which is most likely UTF-8 or one of the horrible Windows ones). You should explicitly specify an encoding.
Assuming the web server is doing a good job, you can find the correct encoding in one of the HTTP headers (I forget which one). Or you can just assume it's iso-8859-2, but that might break later.
Upvotes: 3
Reputation: 1656
InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.
Upvotes: 3