Nikolas
Nikolas

Reputation: 44496

Java getting url with the correct encoding

I'd like to download the sources of many webpages, then write to the file and print it out in the NetBeans console. I have a problem with encoding. First check my code out:

public static final void foo(URL url, Charset endoding, String file) {
    BufferedReader in;
    String readLine;
    try
    {
        in = new BufferedReader(new InputStreamReader(url.openStream(), encoding));
        BufferedWriter out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file) , encoding));
        while ((readLine = in.readLine()) != null) {
            System.out.println(readLine+"\n");
            out.write(readLine+"\n");
        }
        out.flush();
        out.close();
    }
}

I am testing this on 2 foreign websites (ex. Czech and Thai)

I tried Charset.forName("UTF-8") that seems to work correctly for the Thai webpage but actually for the Czech one doesn't. Console and file contains the question mark such as �.

I have also tried ISO-8859-2, that saves the file correctly, but the console shows small rectangle instead of letters ž, š etc..

Does exist any universal solution for multilanguage websites (as Czech, Japan, Thai and more..), that I can save to file correctly as same as print to console or save to variable?

Upvotes: 1

Views: 1359

Answers (1)

Nik-Sch
Nik-Sch

Reputation: 102

The problem is that there is no such thing as the ultimate encoding. The state of the art encoding would probably be UTF-8 at the time, even though each side can decide which encoding it is using by its own. Here is a pretty decent article worth of reading that describes the basic problem of char encoding as a world wide solution.

Therefore, the best Solution would be to get the html page encoding with InputStreamReader.getEncoding():

public static final void foo(URL url, String file){
  BufferedReader in;
  String readLine;
  try{
    InputStreamReader isr = new InputStreamReader(url.openStream());
    String encoding = isr.getEncoding(); //if you actually need it, which I don't suppose
    in = new BufferedReader(isr);
    BufferedWriter out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file) , encoding));
    while ((readLine = in.readLine()) != null) {
      System.out.println(readLine+"\n");
      out.write(readLine+"\n");
    }
    out.flush();
    out.close();
  }
}

This should work as intended.

Upvotes: 2

Related Questions