user425727
user425727

Reputation:

reading file with accented characters in Java

I came across two special characters which seem not to be covered by the ISO-8859-1 character set i.e. they don't make it through to my program.

The German ß and the Norwegian ø

i'm reading the files as follows:

FileInputStream inputFile = new FileInputStream(corpus[i]);
InputStreamReader ir = new InputStreamReader(inputFile, "ISO-8859-1") ;

Is there a way for me to read these characters without having to apply manual replacement as a workaround?

[EDIT]

this is how it looks on screen. Note that i have no problems with other accents e.g. è and the lot...

enter image description here

Upvotes: 4

Views: 14781

Answers (3)

Both characters are present in ISO-Latin-1 (check my name to see why I've looked into this).

If the characters are not read in correctly, the most likely cause is that the text in the file is not saved in that encoding, but in something else.

Depending on your operating system and the origin of the file, possible encodings could be UTF-8 or a Windows code page like 850 or 437.

The easiest way is to look at the file with a hex editor and report back what exact values are saved for these two characters.

Upvotes: 3

Matt Ball
Matt Ball

Reputation: 359776

ISO-8859-1 covers ß and ø, so the file is probably saved in a different encoding. You should pass in file's encoding to new InputStreamReader().

Upvotes: 1

WhiteFang34
WhiteFang34

Reputation: 72039

Assuming that your file is probably UTF-8 encoded, try this:

InputStreamReader ir = new InputStreamReader(inputFile, "UTF-8");

Upvotes: 0

Related Questions