ChristianMurschall
ChristianMurschall

Reputation: 1671

Regex split newlines - keep captured line break characters at end of match

I need to split a text into its lines. But I also need to keep the line break characters at each line.

var text = "abc\r\ndef";  // should be two lines and 8 characters
// var text = "abc\rdef"; // should be two lines and 7 characters
// var text = "abc\ndef"; // should be two lines and 7 characters
var lines = Regex.Split(text, @"(?<=[\r\n|\r|\n])");
// I was hoping it would split into two lines:
// "abc\r\n"
// "def"
var countChars = 0;
foreach (var line in lines)
{
     countChars += line.Length;
}

Assert.That(countChars, Is.EqualTo(8));
Assert.That(lines.Length, Is.EqualTo(2));

It did not feel so complicated in the beginning but I cannot make it work. Perhaps someone has a hint?

Upvotes: 2

Views: 966

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

The problem is that the lookbehind pattern is tried at every position inside the string, and it can find the position the is immediately preceded with \r, with \n and with \r\n in your string. More, [\r\n|\r|\n] is just the same as [\r|\n], it matches a CR, LF or a pipe chars.

If you want to make sure you only match a position immediately preceded with CRLF, or a CR that has no LF after it, or an LF that has no CR before it, you can use

(?<=\r\n|(?<!\r)\n|\r(?!\n))

See the regex demo, it matches:

  • (?<= - a positive lookbehind that matches a location that is immediately preceded with
    • \r\n - a CRLF sequence
    • | - or
    • (?<!\r)\n - an LF not immediately preceded with a CR
    • | - or
    • \r(?!\n) - a CR not immediately followed with an LF
  • ) - end of the lookbehind.

See the C# demo:

var text = "abc\r\ndef";
    foreach (var s in Regex.Split(text, @"(?<=\r\n|(?<!\r)\n|\r(?!\n))"))
        Console.WriteLine("'{0}'",s);

Output:

'abc
'
'def'

Upvotes: 3

Related Questions