Reputation: 2148
I am trying to read a text file with Java. Initially I had hoped to use Files.lines.forEach()
but a MalFormedInputException
forced me to experiment with different styles. I now have the following code, which reads the file with two different techniques. The first try succeeds, but the second try fails with a MalformedInputException
:
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(isr);
int lineNo = 1;
while( (String line = br.readLine()) != null) {
System.out.println("Line " + lineNo + ": " + line);
lineNo++;
}
br.close();
System.out.println("Try again.");
System.out.println();
AtomicInteger lineNoAtomic = new AtomicInteger(1);
try(Stream<String> linesStream = Files.lines(Paths.get(fileName),StandardCharsets.UTF_8)) {
linesStream.forEach(line -> {
System.out.println("Line " + lineNoAtomic.get() + ": " + line);
lineNoAtomic.incrementAndGet();
});
}
The (truncated) output is as follows:
Line 250: ar 2_%D8%A3%D8%A8%D8%B1%D9%8A%D9%84 1 1
Line 251: ar 2_%D8%A3%D9%83%D8%AA%D9%88%D8%A8%D8%B1 1 1
Line 252: ar 2_%D8%AF%D9%8A%D8%B3%D9%85%D8%A8%D8%B1 1 1
Line 253: ar 2_%D8%B0%D9%88_%D8%A7%D9%84%D8%AD%D8%AC%D8%A9 1 1
Line 254: ar 2_%D8%B3%D8%A8%D8%AA%D9%85%D8%A8%D8%B1 2 2
Line 255: ar 2_%D9%81%D8%A8%D8%B1%D8%A7%D9%8A%D8%B1 1 1
Line 256: ar 2_%D9%86%D9%88%D9%81%D9%85%D8%A8%D8%B1 1 1
Line 257: ar 2_%D9%8A%D9%86%D8%A7%D9%8A%D8%B1 2 2
Line 258: ar 3%d8%af%d9%8a_%d8%b3%d8%aa%d9%88%d8%af%d9%8a%d9%88_%d9%85%d8%a7%d9%83%d8%b3 1 1
Line 259: ar 300_(%D9%81%D9%8A%D9%84%D9%85) 1 1
Try again.
Line 250: ar 2_%D8%A3%D8%A8%D8%B1%D9%8A%D9%84 1 1
Line 251: ar 2_%D8%A3%D9%83%D8%AA%D9%88%D8%A8%D8%B1 1 1
Line 252: ar 2_%D8%AF%D9%8A%D8%B3%D9%85%D8%A8%D8%B1 1 1
Line 253: ar 2_%D8%B0%D9%88_%D8%A7%D9%84%D8%AD%D8%AC%D8%A9 1 1
Line 254: ar 2_%D8%B3%D8%A8%D8%AA%D9%85%D8%A8%D8%B1 2 2
Line 255: ar 2_%D9%81%D8%A8%D8%B1%D8%A7%D9%8A%D8%B1 1 1
Uh oh.
java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1
at java.io.BufferedReader$1.hasNext(BufferedReader.java:574)
at java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811)
at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
at cl.gdiazc.pagecounts.Main.main(Main.java:57)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at java.io.BufferedReader$1.hasNext(BufferedReader.java:571)
... 3 more
The input file can be found at: https://www.dropbox.com/s/ne1140qiapdwvcs/sampleInput.txt?dl=0. Might anyone have suggestions as to what the difference might be?
(Note: for some context, the original files I'm trying to read are at https://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-01/)
Upvotes: 2
Views: 1701
Reputation: 31290
I have looked at the data, and it is line 18275 which is the first one to be in conflict with UTF-8.
ar 31_ÏíÓãÈÑ 1 1
61 72 20 33 31 5f cf ed d3 e3 c8 d1 20 31 20 31
a r 3 1 _ Ï í Ó ã È Ñ 1 1
The letters with the diacritical signs are encoded in ISO_8859_1.
This is unusual, and I think not as planned by the organisation that has provided this data. Typically, characters greater than 0x7F (or not in US-ASCII) are encoded as %xy. Which means that % itself must be encoded as %25 (which can be found repeatedly, probably due to another glitch, as this typically precedes a pair of hex digits xy >= 0x80).
You can read this using any 8-bit encoding that maps a single byte to a character. However, no "meaning" should be attached to any byte or character beyond 0x7F, i.e., the ÏíÓãÈÑ are not meant to represent "ÏíÓãÈÑ". I guess you should discard lines with any such character (there are others as well). Also, the pattern %25XY should be considered as blemished.
(I guess that the data with glitches results from badly encoded HTTP requests.)
Edit Recovery?
For the erratic characters (anything beyond 0x7F) I don't think you have any chance. But it should be possible to reconstruct the %25XY by a simple global replacement operation s/%25/%/g applied to the string; then you are left with %XY. I guess you'll have to undo the URL encoding anyway, as what in line 250 you really have is: أبريل
.
Upvotes: 2
Reputation: 4695
The first statement in your question: "I am trying to read a text file with Java" makes it imperative for you to know its encoding beforehand. There is no other way to comprehend a text file without first knowing how it is encoded. This is a philosophical issue, but let's stick to issues that are closer to reality :-). When I downloaded the file you pointed to, the first thing I did was run the Unix command file
on it. It shows:
➜ /tmp file ~/Downloads/sampleInput.txt
/Users/kmhaswade/Downloads/sampleInput.txt: ISO-8859 text
Good! So, we know that the file is actually encoded using the ISO-8859
encoding. The only way for us to make sense of it is by decoding it thus.
Your question "Might anyone have suggestions as to what the difference might be?" is an intriguing one. And the only answer I can give confidently is that it is so because these two appear to be following entirely different call hierarchies inside of the JDK! It appears to me the first attempt uses an internal class sun.nio.cs.StreamDecoder
for its decoding needs, whereas the latter one uses the newer java.nio.charset.CharsetDecoder
for that purpose. The first one is able to make some sense of the text with any encoding provided. Thus, if I do:
String fileName = "/tmp/sampleInput.txt";
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is, StandardCharsets.US_ASCII);
BufferedReader br = new BufferedReader(isr);
it still succeeds. It is not clear however if it does the right thing. The second code snippet:
AtomicInteger lineNoAtomic = new AtomicInteger(1);
try (Stream<String> linesStream = Files.lines(Paths.get(fileName), StandardCharsets.US_ASCII)) {
linesStream.forEach(aLine -> {
System.out.println("Line " + lineNoAtomic.get() + ": " + aLine);
lineNoAtomic.incrementAndGet();
});
}
fails with Caused by: java.nio.charset.MalformedInputException: Input length = 1
however for any value of the charset other than StandardCharsets.ISO_8859_1
! I believe that the behavior of this snippet is more correct because it was asked to decode something with the wrong encoding and it refuses to do so. Also interesting is that the second attempt fails entirely -- not a single line is output!
Perhaps someone from the JDK team should comment on the differences? The fact remains however: We should always read a text file knowing its encoding beforehand.
Upvotes: 0
Reputation: 111219
The input file is not encoded in UTF-8. The first way to read the file silently substitutes the characters it doesn't understand with a replacement character. If you prefer this behaviour but still use streams, you can get a stream of lines from the BufferedReader:
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(isr);
try(Stream<String> linesStream = br.lines()) {
...
}
Ideally you should try to determine what encoding the files are supposed to use, but StandardCharsets.ISO_8859_1
will always work
Upvotes: 1