I am scraping a few websites and some of them contain non-Latin Characters and special characters like “ for quotes rather than " and ’ for apostrophes rather than ' . Here's the real curve ball... I have the relevant text printed out to the console. Everything encodes fine when I run it in my IDE (Netbeans). But when I run it on my computer “I Need Your Help” is printed out as: ΓÇ£I Need Your HelpΓÇ¥ ... Before anyone says I need to set my JAVA_TOOL_OPTIONS Environment Variable to -Dfile.encoding=UTF8 let me say that I have already done that and this is still a problem. Besides, shouldn't my specifying the encoding for the buffered reader to be "UTF-8" override that anyway? Here's some info: I'm using the JDK 7 with the target platform as 1.7 I'm running on a Windows 7 machine for all the machines I'm running this on and experiencing the same problem (some don't have the JAVA_TOOL_OPTIONS set, but that doesn't seem to make any difference). I think the default encoding that it's using is Cp1252... Here's my code. Let me know whether you need more info. Thanks! /** * Using the given url, this method creates and returns the buffered reader for that url * * @param urlString * @return * @throws MalformedURLException * @throws IOException */ public synchronized static BufferedReader getBufferedReader(String urlString) throws MalformedURLException, IOException { URL url = new URL(urlString); InputStream is = url.openStream(); BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8")); return br; }

Reputation: 29091

Why does this BufferedReader not read in the specified UTF-8 Format?

I am scraping a few websites and some of them contain non-Latin Characters and special characters like “ for quotes rather than " and ’ for apostrophes rather than '.

Here's the real curve ball...

I have the relevant text printed out to the console. Everything encodes fine when I run it in my IDE (Netbeans). But when I run it on my computer “I Need Your Help” is printed out as: ΓÇ£I Need Your HelpΓÇ¥...

Before anyone says I need to set my JAVA_TOOL_OPTIONS Environment Variable to -Dfile.encoding=UTF8 let me say that I have already done that and this is still a problem. Besides, shouldn't my specifying the encoding for the buffered reader to be "UTF-8" override that anyway?

Here's some info:

I'm using the JDK 7 with the target platform as 1.7
I'm running on a Windows 7 machine for all the machines I'm running this on and experiencing the same problem (some don't have the JAVA_TOOL_OPTIONS set, but that doesn't seem to make any difference).
I think the default encoding that it's using is Cp1252...

Here's my code. Let me know whether you need more info. Thanks!

/**
 * Using the given url, this method creates and returns the buffered reader for that url
 *
 * @param urlString
 * @return
 * @throws MalformedURLException
 * @throws IOException
 */
public synchronized static BufferedReader getBufferedReader(String urlString) throws MalformedURLException, IOException {
  URL url = new URL(urlString);
  InputStream is = url.openStream();
  BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  return br;
}

Upvotes: 3

Answers (3)

Muhammad Usman Ghani

Reputation: 1279

try {
        reader = new BufferedReader(new InputStreamReader(in,"UTF-8"));
    } catch (UnsupportedEncodingException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
      String line="";
      String s ="";
   try 
   {
       line = reader.readLine();
   } 
   catch (IOException e) 
   {
       e.printStackTrace();
   }
      while (line != null) 
      {
       s = s + line;
       s =s+"\n";
       try 
       {
           line = reader.readLine();
       } 
       catch (IOException e) 
       {
           e.printStackTrace();
       }
    }
    tv.setText(""+s);
  }

Upvotes: -1

parsifal

Reputation: 36

There are two possibilities here. As user1291492 said, it could be that you read the content correctly but the encoding that your terminal uses is different from the one your IDE uses.

The other possibility is that the source data is not in UTF-8. If you're scraping a website, then you should pay attention to what the Website tells you it's using for encoding via the Content-Type header, not assume that it's always UTF-8.

Upvotes: 2

ControlAltDel

Reputation: 35106

IDE's output "window" probably has the capacity to understand and print utf-8 characters. The console may not be so advanced

Upvotes: 1

Why does this BufferedReader not read in the specified UTF-8 Format?

Answers (3)

Related Questions