Konstantin Spirin
Konstantin Spirin

Reputation: 21271

Best way to split string into lines

How do you split multi-line string into lines?

I know this way

var result = input.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

looks a bit ugly and loses empty lines. Is there a better solution?

Upvotes: 194

Views: 239741

Answers (14)

Loudenvier
Loudenvier

Reputation: 8794

Update: I didn't realize the zero-allocation, fastest solution provided by Dennis here https://stackoverflow.com/a/65969222/285678 . If you can work with ReadOnlySpan<char> instead of strings this method is 4 times faster and with no memory allocations whatsoever. However, if you need to convert the lines back to string then it performs the same as the non-span based solutions. I'm updating this answer with this in mind and with an improved version of Dennis' code.

Late to the party, but I've been using a simple collection of extension methods for just that (now packed into a nice nuget: Loudenvier.Utils), which leverages TextReader.ReadLine():

public static class LinesExtensions
{
    public static IEnumerable<string> GetLines(this string line) => GetLines(new StringReader(line));
    public static IEnumerable<string> GetLines(this Stream stm) => GetLines(new StreamReader(stm));
    public static IEnumerable<string> GetLines(this TextReader reader) {
        string? line;
        while ((line = reader.ReadLine()) != null)
            yield return line;
        reader.Dispose();
        yield break;
    }

    public static LineEnumerator GetLinesAsSpans(this string text) => new(text.AsSpan());

    // based on Dennis answer: https://stackoverflow.com/a/65969222/285678
    public ref struct LineEnumerator(ReadOnlySpan<char> text)
    {
        private ReadOnlySpan<char> Text { get; set; } = text;
        public ReadOnlySpan<char> Current { get; private set; } = default;
        public readonly LineEnumerator GetEnumerator() => this;

        public bool MoveNext() {
            if (Text.IsEmpty) return false;

            var index = Text.IndexOf('\n'); // \r\n or \n
            if (index != -1) {
                // removes \r\n or \n from resulting line as most ReadLine methods do
                var shift = index > 0 && Text[index - 1] == '\r' ? 1 : 0;
                Current = Text[..(index - shift)];
                Text = Text[(index + 1)..];
                return true;
            } else {
                Current = Text;
                Text = [];
                return true;
            }
        }
    }
}

Using the code is really trivial:

// If you have the text as a string...
var text = "Line 1\r\nLine 2\r\nLine 3";
foreach (var line in text.GetLines()) 
    Console.WriteLine(line);
foreach (var line in text.GetLinesAsSpans()) {
    // need to call ToString because it's a char span!
    // so the performance benefits of using span are nullified
    // if you can avoid it then using spans is wayyy better
    Console.WriteLine(line.ToString());
}
// You can also use streams like
var fileStm = File.OpenRead("c:\tests\file.txt");
foreach(var line in fileStm.GetLines())
    Console.WriteLine(line);

Hope this helps someone out there.

Upvotes: 3

Jakub Stodola
Jakub Stodola

Reputation: 11

If you need to be sure about line endings format and performance isn't a problem, use this:

string[] result = input.ReplaceLineEndings().Split(Environment.NewLine);

Upvotes: 1

leandromoh
leandromoh

Reputation: 169

Split a string into lines without any allocation.

static IEnumerable<ReadOnlyMemory<char>> GetLines(this string text, string newLine) 
{
    if (text.Length == 0)
        yield break;

    var memory = text.AsMemory();
    int index;

    while ((index = memory.Span.IndexOf(newLine)) != -1) 
    {
        yield return memory.Slice(0, index);
        memory = memory.Slice(index + newLine.Length);
    }

    yield return memory;
}

Example of use

foreach (ReadOnlyMemory<char>> line in GetLines(text, "\r\n"))
{
   // use the line variable or if needed...
   // alternative use

   ReadOnlySpan<char> span = line.Span;
   string str = line.Span.ToString();
}

Upvotes: 1

Denis535
Denis535

Reputation: 3590

Split a string into lines without any allocation.

public static LineEnumerator GetLines(this string text) {
    return new LineEnumerator( text.AsSpan() );
}

internal ref struct LineEnumerator {

    private ReadOnlySpan<char> Text { get; set; }
    public ReadOnlySpan<char> Current { get; private set; }

    public LineEnumerator(ReadOnlySpan<char> text) {
        Text = text;
        Current = default;
    }

    public LineEnumerator GetEnumerator() {
        return this;
    }

    public bool MoveNext() {
        if (Text.IsEmpty) return false;

        var index = Text.IndexOf( '\n' ); // \r\n or \n
        if (index != -1) {
            Current = Text.Slice( 0, index + 1 );
            Text = Text.Slice( index + 1 );
            return true;
        } else {
            Current = Text;
            Text = ReadOnlySpan<char>.Empty;
            return true;
        }
    }


}

Upvotes: 6

MAG TOR
MAG TOR

Reputation: 129

string[] lines = input.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);

Upvotes: 7

Glenn Slayden
Glenn Slayden

Reputation: 18749

It's tricky to handle mixed line endings properly. As we know, the line termination characters can be "Line Feed" (ASCII 10, \n, \x0A, \u000A), "Carriage Return" (ASCII 13, \r, \x0D, \u000D), or some combination of them. Going back to DOS, Windows uses the two-character sequence CR-LF \u000D\u000A, so this combination should only emit a single line. Unix uses a single \u000A, and very old Macs used a single \u000D character. The standard way to treat arbitrary mixtures of these characters within a single text file is as follows:

  • each and every CR or LF character should skip to the next line EXCEPT...
  • ...if a CR is immediately followed by LF (\u000D\u000A) then these two together skip just one line.
  • String.Empty is the only input that returns no lines (any character entails at least one line)
  • The last line must be returned even if it has neither CR nor LF.

The preceding rule describes the behavior of StringReader.ReadLine and related functions, and the function shown below produces identical results. It is an efficient C# line breaking function that dutifully implements these guidelines to correctly handle any arbitrary sequence or combination of CR/LF. The enumerated lines do not contain any CR/LF characters. Empty lines are preserved and returned as String.Empty.

/// <summary>
/// Enumerates the text lines from the string.
///   ⁃ Mixed CR-LF scenarios are handled correctly
///   ⁃ String.Empty is returned for each empty line
///   ⁃ No returned string ever contains CR or LF
/// </summary>
public static IEnumerable<String> Lines(this String s)
{
    int j = 0, c, i;
    char ch;
    if ((c = s.Length) > 0)
        do
        {
            for (i = j; (ch = s[j]) != '\r' && ch != '\n' && ++j < c;)
                ;

            yield return s.Substring(i, j - i);
        }
        while (++j < c && (ch != '\r' || s[j] != '\n' || ++j < c));
}

Note: If you don't mind the overhead of creating a StringReader instance on each call, you can use the following C# 7 code instead. As noted, while the example above may be slightly more efficient, both of these functions produce the exact same results.

public static IEnumerable<String> Lines(this String s)
{
    using (var tr = new StringReader(s))
        while (tr.ReadLine() is String L)
            yield return L;
}

Upvotes: 2

orad
orad

Reputation: 16056

Update: See here for an alternative/async solution.


This works great and is faster than Regex:

input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)

It is important to have "\r\n" first in the array so that it's taken as one line break. The above gives the same results as either of these Regex solutions:

Regex.Split(input, "\r\n|\r|\n")

Regex.Split(input, "\r?\n|\r")

Except that Regex turns out to be about 10 times slower. Here's my test:

Action<Action> measure = (Action func) => {
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++) {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
);

measure(() =>
    Regex.Split(input, "\r\n|\r|\n")
);

measure(() =>
    Regex.Split(input, "\r?\n|\r")
);

Output:

00:00:03.8527616

00:00:31.8017726

00:00:32.5557128

and here's the Extension Method:

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        return str.Split(new[] { "\r\n", "\r", "\n" },
            removeEmptyLines ? StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None);
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines

Upvotes: 82

orad
orad

Reputation: 16056

I had this other answer but this one, based on Jack's answer, is significantly faster might be preferred since it works asynchronously, although slightly slower.

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        using (var sr = new StringReader(str))
        {
            string line;
            while ((line = sr.ReadLine()) != null)
            {
                if (removeEmptyLines && String.IsNullOrWhiteSpace(line))
                {
                    continue;
                }
                yield return line;
            }
        }
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines

Test:

Action<Action> measure = (Action func) =>
{
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++)
    {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None)
);

measure(() =>
    input.GetLines()
);

measure(() =>
    input.GetLines().ToList()
);

Output:

00:00:03.9603894

00:00:00.0029996

00:00:04.8221971

Upvotes: 5

Konrad Rudolph
Konrad Rudolph

Reputation: 545588

  • If it looks ugly, just remove the unnecessary ToCharArray call.

  • If you want to split by either \n or \r, you've got two options:

    • Use an array literal – but this will give you empty lines for Windows-style line endings \r\n:

      var result = text.Split(new [] { '\r', '\n' });
      
    • Use a regular expression, as indicated by Bart:

      var result = Regex.Split(text, "\r\n|\r|\n");
      
  • If you want to preserve empty lines, why do you explicitly tell C# to throw them away? (StringSplitOptions parameter) – use StringSplitOptions.None instead.

Upvotes: 231

Bart Kiers
Bart Kiers

Reputation: 170158

You could use Regex.Split:

string[] tokens = Regex.Split(input, @"\r?\n|\r");

Edit: added |\r to account for (older) Mac line terminators.

Upvotes: 37

Jack
Jack

Reputation: 4904

using (StringReader sr = new StringReader(text)) {
    string line;
    while ((line = sr.ReadLine()) != null) {
        // do something
    }
}

Upvotes: 171

John Thompson
John Thompson

Reputation: 386

    private string[] GetLines(string text)
    {

        List<string> lines = new List<string>();
        using (MemoryStream ms = new MemoryStream())
        {
            StreamWriter sw = new StreamWriter(ms);
            sw.Write(text);
            sw.Flush();

            ms.Position = 0;

            string line;

            using (StreamReader sr = new StreamReader(ms))
            {
                while ((line = sr.ReadLine()) != null)
                {
                    lines.Add(line);
                }
            }
            sw.Close();
        }



        return lines.ToArray();
    }

Upvotes: 2

JDunkerley
JDunkerley

Reputation: 12495

Slightly twisted, but an iterator block to do it:

public static IEnumerable<string> Lines(this string Text)
{
    int cIndex = 0;
    int nIndex;
    while ((nIndex = Text.IndexOf(Environment.NewLine, cIndex + 1)) != -1)
    {
        int sIndex = (cIndex == 0 ? 0 : cIndex + 1);
        yield return Text.Substring(sIndex, nIndex - sIndex);
        cIndex = nIndex;
    }
    yield return Text.Substring(cIndex + 1);
}

You can then call:

var result = input.Lines().ToArray();

Upvotes: 2

Jonas Elfstr&#246;m
Jonas Elfstr&#246;m

Reputation: 31428

If you want to keep empty lines just remove the StringSplitOptions.

var result = input.Split(System.Environment.NewLine.ToCharArray());

Upvotes: 11

Related Questions