kechap
kechap

Reputation: 2087

Line endings confusion

I made a simple parser with java that reads a file one character at a time and constructs words.

I tried to run it under Linux and I noticed that looking for '\n' doesn't work. Although if I compare the character with the value 10 it works as expected. According to the ASCII table value 10 is LF (line feed). I read somewhere (I don't remember where) that Java should be able to find a newline only by looking for '\n'.

I am using BufferedReader and the read method to read characters.

EDIT

readLine cannot be used because it will produce other problems

It looks like the problem appears when I am using files with mac/windows file endings under linux.

Upvotes: 0

Views: 3187

Answers (3)

A4L
A4L

Reputation: 17595

here are two ways can do it

1- use read line by line and split each using a regular expression to get the single words

2- write your own isDelimiter method and use it to check whether you reached a split contition or not

package misctests;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import java.util.ArrayList;
import java.util.List;
import org.junit.Test;


public class SplitToWords {

    String someWords = "Lorem ipsum\r\n(dolor@sit)amet,\nconsetetur!\rsadipscing'elitr;sed~diam";
    String delimsRegEx = "[\\s;,\\(\\)!'@~]+";
    String delimsPlain = ";,()!'@~"; // without whitespaces

    String[] expectedWords = {
        "Lorem",
        "ipsum",
        "dolor",
        "sit",
        "amet",
        "consetetur",
        "sadipscing",
        "elitr",
        "sed",
        "diam"
    };

    private static final class StringReader {
        String input = null;
        int pos = 0;
        int len = 0;
        StringReader(String input) {
            this.input = input == null ? "" : input;
            len = this.input.length();
        }

        public boolean hasMoreChars() {
            return pos < len;
        }

        public int read() {
            return hasMoreChars() ? ((int) input.charAt(pos++)) : 0;
        }
    }

    @Test
    public void splitToWords_1() {
        String[] actual = someWords.split(delimsRegEx);
        assertEqualsWords(expectedWords, actual);
    }

    @Test
    public void splitToWords_2() {
        StringReader sr = new StringReader(someWords);
        List<String> words = new ArrayList<String>();
        StringBuilder sb = null;
        int c = 0;
        while(sr.hasMoreChars()) {
            c = sr.read();
            while(sr.hasMoreChars() && isDelimiter(c)) {
                c = sr.read();
            }
            sb = new StringBuilder();
            while(sr.hasMoreChars() && ! isDelimiter(c)) {
                sb.append((char)c);
                c = sr.read();
            }
            if(! isDelimiter(c)) {
                sb.append((char)c);
            }
            words.add(sb.toString());
        }

        String[] actual = new String[words.size()];
        words.toArray(actual);

        assertEqualsWords(expectedWords, actual);
    }

    private boolean isDelimiter(int c) {
        return (Character.isWhitespace(c) ||
            delimsPlain.contains(new String(""+(char)c))); // this part is subject for optimization
    }

    private void assertEqualsWords(String[] expected, String[] actual) {
        assertNotNull(expected);
        assertNotNull(actual);
        assertEquals(expected.length, actual.length);
        for(int i = 0; i < expected.length; i++) {
            assertEquals(expected[i], actual[i]);
        }
    }
}

Upvotes: 1

Sunil Kumar Sahoo
Sunil Kumar Sahoo

Reputation: 53687

Use readLine() to read text line by line basis

Example

FileInputStream fstream = new FileInputStream("textfile.txt");
  // Get the object of DataInputStream
  DataInputStream in = new DataInputStream(fstream);
  BufferedReader br = new BufferedReader(new InputStreamReader(in));
  String strLine;
  //Read File Line By Line
  while ((strLine = br.readLine()) != null)   {
  // Print the content on the console
  System.out.println (strLine);
  }
  //Close the input stream
  in.close();
    }catch (Exception e){//Catch exception if any
  System.err.println("Error: " + e.getMessage());
  }

Upvotes: 2

A4L
A4L

Reputation: 17595

If you read files byte by byte you have to take care of all 3 cases '\n' for Linux, "\r\n" for windows and '\r' for mac.

Use the method readLine instead. It takes care of these things for you and returns only the line without any terminators. After reading each line you can tokenize it to get the single words.

Also consider uring the system property "line.separator". It always holds the system dependent Line terminator witch makes at least your code (not the produced files) more portale.

Upvotes: 1

Related Questions