latata
latata

Reputation: 1723

How to get non-latin characters from website?

I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)

 final URL url = new URL("http://latata.pl/pl.php");
    final URLConnection urlConnection = url.openConnection();
    final BufferedReader in = new BufferedReader(new InputStreamReader(
            urlConnection.getInputStream()));
    String inputLine;

    while ((inputLine = in.readLine()) != null) {
        System.out.println(inputLine);
    }
    in.close();

It doesn't work. :( Any ideas?

Upvotes: 2

Views: 756

Answers (5)

Will
Will

Reputation: 241

As someone has already stated there is no charset encoding specified for the response. Forcing the response document to be viewed as ISO-8859-2 (typically used in central Europe) results in legitimate polish characters being displayed, so I assume this is the encoding actually being used. Since no encoding has been specified, ISO-8859-1 will be assumed as this is the default.

The response headers need to include the header Content-Type: text/html; charset=ISO-8859-2 for the character code points to be interpreted correctly. This charset will be used when constructing the response InputStream.

Upvotes: 2

Michael Konietzka
Michael Konietzka

Reputation: 5499

The output of your php-script pl.php is faulty. There is a HTTP-header Content-Type: text/html set without a declared charset. Without a declared charset, the client has to assume that it is ISO-8859-1 regarding to the HTTP-specifications. The sent body is ±ê³ó¿¡Ê£¯¬ if interpreted as ISO-8859-1.

The bytes sended by the php-script are representing ąęłóżĄĘŁŻŹ if it were declared as

Content-Type: text/html; charset=ISO-8859-2

You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:

final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);    

The output will be ąęłóżĄĘŁŻŹ, which are some polish characters.

As a quick fix, set the charset in your php-script to output Content-Type: text/html; charset=ISO-8859-2 as HTTP-Header.

But you should think about to switch to UTF-8 encoded output anyway.

Upvotes: 2

SyntaxT3rr0r
SyntaxT3rr0r

Reputation: 28293

This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.

Here's what you get back:

$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl

HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html

����ʣ��Connection closed by foreign host.

The HTML is simply:

<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>

And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?

Upvotes: 2

dty
dty

Reputation: 18998

Your InputStreamReader will be attempting to convert the bytes coming back over the TCP connection using your platform default encoding (which is most likely UTF-8 or one of the horrible Windows ones). You should explicitly specify an encoding.

Assuming the web server is doing a good job, you can find the correct encoding in one of the HTTP headers (I forget which one). Or you can just assume it's iso-8859-2, but that might break later.

Upvotes: 3

v6ak
v6ak

Reputation: 1656

InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.

Upvotes: 3

Related Questions