Reputation: 153963
java.util.Scanner can't handle non-breaking spaces in file content which is bizarre.
Here is the input text, put this in a file called asdf.txt
:
lines lines lines
asdf jkl
lines lines lines
Between asdf
and jkl
is a non breaking space. Specifically:
echo "asdf jkl" | od -c
0000000 a s d f 302 240 j k l \n
0000012
You can copy/paste it in here and see it: http://www.fontspace.com/unicode/analyzer/
The offending character is also known as: 302 240
, U+00A0
,
,  
,  
, %C2%A0
The code:
import java.util.*;
import java.io.*;
public class Main{
public static void main(String args[]){
Scanner r = null;
try{
File f = new File("/home2/ericlesc/testfile/asdf.txt");
r = new Scanner(f);
while(r.hasNextLine()){
String line = r.nextLine();
System.out.println("line is: " + line);
}
System.out.println("done");
}
catch(Exception e){
e.printStackTrace();
}
}
}
java.util.Scanner pukes on this content. Surprisingly, it does NOT throw an exception saying "can't process this character". It doesn't stop on the offending line, the Scanner pukes roughly 30 characters before the offending character.
Maybe there is known documentation on how I can use java.util.Scanner to read in a non breaking space without puking?
Why can't java.util.Scanner process non breaking space? How can I make it process it as normal?
Upvotes: 1
Views: 1003
Reputation: 153963
With help from powerlord, I was able to use this code to produce the desired output:
import java.util.*;
import java.io.*;
public class Main{
public static void main(String args[]){
Scanner r = null;
try{
File f = new File("/home2/ericlesc/testfile/asdf.txt");
r = new Scanner(f, "ISO-8859-1");
while(r.hasNextLine()){
String line = r.nextLine();
System.out.println("line is: " + line);
}
System.out.println("done");
}
catch(Exception e){
e.printStackTrace();
}
}
}
Program prints:
javac Main.java && java Main
line is: lines lines lines
line is: asdf jkl
line is: lines lines lines
You have to specify the same charset that was used to encode the characters, else Scanner will exhibit undefined behavior when it encounters a character it does not understand.
Upvotes: 0
Reputation: 88796
Unless you tell it otherwise, Scanner assumes the system's default charset. I'm not sure about other OSes, but on Windows, this is one of the ISO 8859 charsets for compatibility reasons.
Luckily, you can tell Scanner
what CharSet
you want it to use by using one of the 2 argument constructors like this one.
Upvotes: 3