kentcdodds
kentcdodds

Reputation: 29021

Why does this BufferedReader not read in the specified UTF-8 Format?

I am scraping a few websites and some of them contain non-Latin Characters and special characters like for quotes rather than " and for apostrophes rather than '.

Here's the real curve ball...

I have the relevant text printed out to the console. Everything encodes fine when I run it in my IDE (Netbeans). But when I run it on my computer “I Need Your Help” is printed out as: ΓÇ£I Need Your HelpΓÇ¥...

Before anyone says I need to set my JAVA_TOOL_OPTIONS Environment Variable to -Dfile.encoding=UTF8 let me say that I have already done that and this is still a problem. Besides, shouldn't my specifying the encoding for the buffered reader to be "UTF-8" override that anyway?

Here's some info:

Here's my code. Let me know whether you need more info. Thanks!

/**
 * Using the given url, this method creates and returns the buffered reader for that url
 *
 * @param urlString
 * @return
 * @throws MalformedURLException
 * @throws IOException
 */
public synchronized static BufferedReader getBufferedReader(String urlString) throws MalformedURLException, IOException {
  URL url = new URL(urlString);
  InputStream is = url.openStream();
  BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  return br;
}

Upvotes: 3

Views: 8704

Answers (3)

Muhammad Usman Ghani
Muhammad Usman Ghani

Reputation: 1279

try {
        reader = new BufferedReader(new InputStreamReader(in,"UTF-8"));
    } catch (UnsupportedEncodingException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
      String line="";
      String s ="";
   try 
   {
       line = reader.readLine();
   } 
   catch (IOException e) 
   {
       e.printStackTrace();
   }
      while (line != null) 
      {
       s = s + line;
       s =s+"\n";
       try 
       {
           line = reader.readLine();
       } 
       catch (IOException e) 
       {
           e.printStackTrace();
       }
    }
    tv.setText(""+s);
  }

Upvotes: -1

parsifal
parsifal

Reputation: 36

There are two possibilities here. As user1291492 said, it could be that you read the content correctly but the encoding that your terminal uses is different from the one your IDE uses.

The other possibility is that the source data is not in UTF-8. If you're scraping a website, then you should pay attention to what the Website tells you it's using for encoding via the Content-Type header, not assume that it's always UTF-8.

Upvotes: 2

ControlAltDel
ControlAltDel

Reputation: 35011

IDE's output "window" probably has the capacity to understand and print utf-8 characters. The console may not be so advanced

Upvotes: 1

Related Questions