Reputation: 139
when I try to read thesaurus.txt, it reads it as "ÿþ ", although the first entry is "<pat>a cappella
". What could be causing this?
File file = new File("thesaurus.txt");
Scanner scan;
try {
scan = new Scanner(file);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
scan = null;
}
String entry;
ArrayList<String> thes = new ArrayList<String>();
while(scan.hasNext())
{
entry = scan.nextLine();
if(entry != "")
{
thes.add(entry);
}
}
return thes;
Upvotes: 0
Views: 215
Reputation: 206896
Yout input file is probably an UTF-16 (LE) file that starts with a byte order mark.
If you look at this file as if it is ISO 8859-1 you'll see those two characters: ÿþ which have codes FF
and FE
in that character encoding, which are exactly what you would expect when there's a UTF-16 BOM present.
You should explicitly specify the character encoding when reading the file, instead of relying on the default character encoding of your system:
scan = new Scanner(file, "UTF-16");
Upvotes: 3