alwaysLearning
alwaysLearning

Reputation: 99

Character encoding

I get html file which I need to read and parse, this file can be in plain English, japenese, or any language with associated character encoding required for that language. The problem occurs when file is in Japenese with any of these encodings

I tried reading file with FileReader but resulting file is all garbage characters. I also tried using FileInputStream with just hard coding japenese encoding to check if Japanese file is read correctly but result is not as expected.

FileInputStream fis = new FileInputStream(htmlFile);
InputStreamReader isr = new InputStreamReader(fis, " ISO-2022-JP");

I don’t have much experience with character encoding and internationalization, any suggestions on how I can read/write files with different encodings?

one more thing, I don't know how to get the character encoding of the html file I am reading, I understand that I need to write file in same encoding but not sure how to get original file's encoding Thanks,

Upvotes: 2

Views: 1471

Answers (1)

Michael Borgwardt
Michael Borgwardt

Reputation: 346476

  • Forget that FileReader exists, it implicitly uses the platform default encoding, which makes it pretty much useless.
  • Your code with the hardcoded encoding is correct except for the encoding itself, which has a leading space. If you remove it, the code should correctly read ISO-2022-JP encoded files
  • As for getting the character encoding of the HTML file, there are a number of ways it can be transmitted
    • on the HTTP level in a Content-Type HTTP header - but this is only available when you read the file from the webserver, not when it's saved as a file
    • as a corresponding META HTML tag: <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
    • or, if the document type is XHTML, in the XML declaration: <?xml version="1.0" encoding="UTF-8"?>

Upvotes: 4

Related Questions