Reputation: 83
I've got a regex "[\r\n\f]+" to find the number of lines contained in a String. My code is like this:
pattern = Pattern.compile("[\\r\\n\\f]+")
String[] lines = pattern.split(texts);
In my unit test I've got sample strings like these:
"\t\t\t \r\n \n"
"\r\n"
The result of parsing the first string is 2, however it becomes 0 when it's parsing the second string.
I thought the second string includes 1 line although the line is "blank" (suppose I'm editing a file which begins with "\r\n" in a text editor, should the caret be placed at the second line?). Is my regex incorrect for parsing lines? or am I missing something here?
Edit:
I think I'll make the question more obvious:
Why
// notice the trailing space in the string
"\r\n ".split("\r\n").length == 2 // results in 2 strings {"", " "}. So this block of text has two lines.
but
// notice there's no trailing space in the string
"\r\n".split("\r\n").length == 0 // results in an empty array. Why "" (empty string) is not in the result and this block of text contains 0 lines?
Upvotes: 4
Views: 995
Reputation: 5103
What counts as a line really depends on your environment. quote from wikipedia:
LF: Multics, Unix and Unix-like systems (GNU/Linux, OS X, FreeBSD, AIX, Xenix, etc.), BeOS, Amiga, RISC OS and others.
CR: Commodore 8-bit machines, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, Mac OS up to version 9 and OS-9
RS: QNX pre-POSIX implementation. 0x9B: Atari 8-bit machines using ATASCII variant of ASCII. (155 in decimal)
LF+CR: Acorn BBC and RISC OS spooled text output.
CR+LF: Microsoft Windows, DEC TOPS-10, RT-11 and most other early non-Unix and non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC DOS, etc.), Atari TOS, OS/2, Symbian OS, Palm OS, Amstrad CPC
Perhaps you should try an arch neutral approach:
String test = "\t\t\t \r\n \n";
BufferedReader reader = new BufferedReader(new StringReader(test));
int count = 0;
String line=null;
while ((line=reader.readLine()) != null) {
System.out.println(++count+":"+line);
}
System.out.println("total lines == "+count);
Edited to include Alan Moore's note about using .ready()
Upvotes: 0
Reputation: 143154
From the documentation for Pattern.split(CharSequence)
:
This method works as if by invoking the two-argument split method with the given input sequence and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
Many would agree that this behavior is confusingly inconsistent. You can disable the removale of trailing blanks by including a negative limit (all negative values do the same thing):
String[] lines = pattern.split(texts, -1);
Upvotes: 5