Reputation: 44496
I'd like to download the sources of many webpages, then write to the file and print it out in the NetBeans console. I have a problem with encoding. First check my code out:
public static final void foo(URL url, Charset endoding, String file) {
BufferedReader in;
String readLine;
try
{
in = new BufferedReader(new InputStreamReader(url.openStream(), encoding));
BufferedWriter out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file) , encoding));
while ((readLine = in.readLine()) != null) {
System.out.println(readLine+"\n");
out.write(readLine+"\n");
}
out.flush();
out.close();
}
}
I am testing this on 2 foreign websites (ex. Czech and Thai)
I tried Charset.forName("UTF-8") that seems to work correctly for the Thai webpage but actually for the Czech one doesn't. Console and file contains the question mark such as �.
I have also tried ISO-8859-2, that saves the file correctly, but the console shows small rectangle instead of letters ž, š etc..
Does exist any universal solution for multilanguage websites (as Czech, Japan, Thai and more..), that I can save to file correctly as same as print to console or save to variable?
Upvotes: 1
Views: 1359
Reputation: 102
The problem is that there is no such thing as the ultimate encoding. The state of the art encoding would probably be UTF-8 at the time, even though each side can decide which encoding it is using by its own. Here is a pretty decent article worth of reading that describes the basic problem of char encoding as a world wide solution.
Therefore, the best Solution would be to get the html page encoding with InputStreamReader.getEncoding()
:
public static final void foo(URL url, String file){
BufferedReader in;
String readLine;
try{
InputStreamReader isr = new InputStreamReader(url.openStream());
String encoding = isr.getEncoding(); //if you actually need it, which I don't suppose
in = new BufferedReader(isr);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file) , encoding));
while ((readLine = in.readLine()) != null) {
System.out.println(readLine+"\n");
out.write(readLine+"\n");
}
out.flush();
out.close();
}
}
This should work as intended.
Upvotes: 2