Reputation: 2087
I made a simple parser with java that reads a file one character at a time and constructs words.
I tried to run it under Linux and I noticed that looking for '\n'
doesn't work. Although if I compare the character with the value 10
it works as expected. According to the ASCII table value 10 is LF (line feed). I read somewhere (I don't remember where) that Java should be able to find a newline only by looking for '\n'
.
I am using BufferedReader
and the read
method to read characters.
readLine
cannot be used because it will produce other problems
It looks like the problem appears when I am using files with mac/windows file endings under linux.
Upvotes: 0
Views: 3187
Reputation: 17595
here are two ways can do it
1- use read line by line and split each using a regular expression to get the single words
2- write your own isDelimiter method and use it to check whether you reached a split contition or not
package misctests;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import java.util.ArrayList;
import java.util.List;
import org.junit.Test;
public class SplitToWords {
String someWords = "Lorem ipsum\r\n(dolor@sit)amet,\nconsetetur!\rsadipscing'elitr;sed~diam";
String delimsRegEx = "[\\s;,\\(\\)!'@~]+";
String delimsPlain = ";,()!'@~"; // without whitespaces
String[] expectedWords = {
"Lorem",
"ipsum",
"dolor",
"sit",
"amet",
"consetetur",
"sadipscing",
"elitr",
"sed",
"diam"
};
private static final class StringReader {
String input = null;
int pos = 0;
int len = 0;
StringReader(String input) {
this.input = input == null ? "" : input;
len = this.input.length();
}
public boolean hasMoreChars() {
return pos < len;
}
public int read() {
return hasMoreChars() ? ((int) input.charAt(pos++)) : 0;
}
}
@Test
public void splitToWords_1() {
String[] actual = someWords.split(delimsRegEx);
assertEqualsWords(expectedWords, actual);
}
@Test
public void splitToWords_2() {
StringReader sr = new StringReader(someWords);
List<String> words = new ArrayList<String>();
StringBuilder sb = null;
int c = 0;
while(sr.hasMoreChars()) {
c = sr.read();
while(sr.hasMoreChars() && isDelimiter(c)) {
c = sr.read();
}
sb = new StringBuilder();
while(sr.hasMoreChars() && ! isDelimiter(c)) {
sb.append((char)c);
c = sr.read();
}
if(! isDelimiter(c)) {
sb.append((char)c);
}
words.add(sb.toString());
}
String[] actual = new String[words.size()];
words.toArray(actual);
assertEqualsWords(expectedWords, actual);
}
private boolean isDelimiter(int c) {
return (Character.isWhitespace(c) ||
delimsPlain.contains(new String(""+(char)c))); // this part is subject for optimization
}
private void assertEqualsWords(String[] expected, String[] actual) {
assertNotNull(expected);
assertNotNull(actual);
assertEquals(expected.length, actual.length);
for(int i = 0; i < expected.length; i++) {
assertEquals(expected[i], actual[i]);
}
}
}
Upvotes: 1
Reputation: 53687
Use readLine()
to read text line by line basis
Example
FileInputStream fstream = new FileInputStream("textfile.txt");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println (strLine);
}
//Close the input stream
in.close();
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
Upvotes: 2
Reputation: 17595
If you read files byte by byte you have to take care of all 3 cases '\n' for Linux, "\r\n" for windows and '\r' for mac.
Use the method readLine instead. It takes care of these things for you and returns only the line without any terminators. After reading each line you can tokenize it to get the single words.
Also consider uring the system property "line.separator". It always holds the system dependent Line terminator witch makes at least your code (not the produced files) more portale.
Upvotes: 1