re3el
re3el

Reputation: 785

Efficient way to extract just one element from a delimited string

NOTE: I did ask the same question here but since some people have marked it as duplicate though it had some crafty, neat solutions, I had to create this extra(dupe) question to make it easier for others who are facing similar doubts. Added the question based on the suggestion of fellow stack overflow members.

What is the efficient way to parse through a large delimited string so that I can access just one element from the delimited set without having to store the other substrings involved?

I specifically am not interested in storing the rest of the element values as done when using Split() method since all of this information is irrelevant to the problem at hand. Also, I want to save memory in doing the same.

Problem Statement:
Given the exact delimited position, I need to extract the element contained in that given position in the most efficient way in terms of memory consumed and time taken.

Simple example string: "1,2,3,4,....,21,22,23,24"
Delimter: ,
Delimited Position: 22
Answer expected: 23

Another example string: "61d2e3f6-bcb7-4cd1-a81e-4f8f497f0da2;0;192.100.0.102:4362;2014-02-14;283;0;354;23;0;;;""0x8D15A2913C934DE"";Thursday, 19-Jun-14 22:58:10 GMT;"
Delimiter: ;
Delimited Position: 7
Expected Answer: 23

Upvotes: 2

Views: 738

Answers (5)

John Wu
John Wu

Reputation: 52240

If you want to be sure the code parses the string in only one pass, and only parses what is needed, you can write the routine that iterates over the string yourself.

Since all c# strings implement IEnumerable<char> it is fairly straightforward to devise a method that requires zero string allocations:

static public IEnumerable<char> GetDelimitedField(this IEnumerable<char> source, char delimiter, int index)
{
    foreach (var c in source)
    {
        if (c == delimiter) 
        {
            if (--index < 0) yield break;
        }
        else
        {
            if (index == 0) yield return c;
        }
    }
}

This returns the result as an IEnumerable<char> but it's cheap to convert to a string. It's going to be a much shorter string at this point anyway.

static public string GetDelimitedString(this string source, char delimiter, int index)
{
    var result = source.GetDelimitedField(delimiter, index);
    return new string(result.ToArray());
}

And you can call it like this:

var input ="Zero,One,Two,Three,Four,Five,Six";
var output = input.GetDelimitedString(',',5);
Console.WriteLine(output);

Output:

Five

Example on DotNetFiddle

Upvotes: 1

Clay Ver Valen
Clay Ver Valen

Reputation: 1073

Too late for "answer" but this code gives me a run time of about 0.75 seconds with both strings processed 1,000,000 times. Difference this time is that now I'm not Marshaling an object but using pointers.

And this time I am returning a single new string (String.Substring).

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;

class Program
{
    static void Main(string[] args)
    {
        string testString1 = "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24";
        string testString2 = "61d2e3f6-bcb7-4cd1-a81e-4f8f497f0da2;0;192.100.0.102:4362;2014-02-14;283;0;354;23;0;;;\"0x8D15A2913C934DE\";Thursday, 19-Jun-14 22:58:10 GMT;";

        Stopwatch sw = new Stopwatch();
        sw.Start();
        for (int i = 1; i < 1000000; i++)
        {
            Delimit(testString1, ',', 22);
            Delimit(testString2, ';', 6);
        }
        sw.Stop();
        Console.WriteLine($"==>{sw.ElapsedMilliseconds}");
        Console.ReadLine();
    }

    static string Delimit(string stringUnderTest, char delimiter, int skipCount)
    {
        const int SIZE_OF_UNICHAR = 2;

        int i = 0;
        int index = 0;
        char c = Char.MinValue;

        GCHandle handle = GCHandle.Alloc(stringUnderTest, GCHandleType.Pinned);
        try
        {
            IntPtr ptr = handle.AddrOfPinnedObject();
            for (i = 0; i < skipCount; i++)
                while ((char)Marshal.ReadByte(ptr, index += SIZE_OF_UNICHAR) != delimiter) ;
            i = index;
            while ((c = (char)Marshal.ReadByte(ptr, i += SIZE_OF_UNICHAR)) != delimiter) ;
        }
        finally
        {
            if (handle.IsAllocated)
                handle.Free();
        }

        return stringUnderTest.Substring((index + SIZE_OF_UNICHAR) >> 1, (i - index - SIZE_OF_UNICHAR) >> 1);
    }
}

Upvotes: -1

lagripe
lagripe

Reputation: 764

By using the following Regex : ^([^;]*;){21}(.*?); , with that you don't have to generate the hole split list to search for your desired position, and once you reach it, it gonna be a matter of whether exists or not.

Explanation :

^ --> start of a line.

([^;]*;){Position - 1} --> notice that the symbol ; here is the delimiter, the expression will loop Pos - 1 times

(.*?) --> Non-Greedy .*

DEMO

For more about regular expressions on C# : documentation

In the example below i did implemant the two samples to show you how it works.

Match Method : documentation (Basically it searchs only for the first occurence of the pattern) RegexOptions.Singleline : Treats the input as a signle line.

C# Code

Console.WriteLine("First Delimiter : ");
        int Position = 22;
        char delimiter = ',';
        string pattern = @"^([^" + delimiter + "]*" + delimiter + "){" + (Position - 1) + @"}(.*?)" + delimiter;
        Regex regex = new Regex(pattern, RegexOptions.Singleline);
        // First Example
        string Data = @"AAV,zzz,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22ABC,23,24,24";
        Match Re = regex.Match(Data);
        if (Re.Groups.Count > 0)
            Console.WriteLine("\tMatch found : " + Re.Groups[2]);


        // Second Example
        Console.WriteLine("Second Delimiter : ");
        Position = 8;
        delimiter = ';';
        pattern = @"^([^" + delimiter + "]*" + delimiter + "){" + (Position - 1) + @"}(.*?)" + delimiter;
        Data = @"61d2e3f6-bcb7-4cd1-a81e-4f8f497f0da2;0;192.100.0.102:4362;2014-02-14;283;0;354;23;0;;;""0x8D15A2913C934DE"";Thursday, 19-Jun-14 22:58:10 GMT;";
        regex = new Regex(pattern, RegexOptions.Singleline);
        Re = regex.Match(Data);
        if (Re.Groups.Count > 0)
            Console.WriteLine("\tMatch found : " + Re.Groups[2]);

Output :

First Delimiter :

    Match found : 22ABC

Second Delimiter :

    Match found : 23

Upvotes: 1

Derviş Kayımbaşıoğlu
Derviş Kayımbaşıoğlu

Reputation: 30565

try this:

public static string MyExtension(this string s, char delimiter, int n)
{
    var begin = n== 0 ? 0 : Westwind.Utilities.StringUtils.IndexOfNth(s, delimiter, n);
    if (begin == -1)
        return null;
    var end = s.IndexOf(delimiter, begin +  (n==0?0:1));
    if (end == -1 ) end = s.Length;
    //var end = Westwind.Utilities.StringUtils.IndexOfNth(s, delimiter, n + 1);
    var result = s.Substring(begin +1, end - begin -1 );

    return result;
}

PS: Library used is Westwind.Utilities


Benchmark Code:

void Main()
{

     string s = string.Join(";", Enumerable.Range(65, 26).Select(c => (char)c));
            s = s.Insert(3, ";;;");

            string o = "";

            Stopwatch sw = new Stopwatch();

            sw.Start();
            for (int i = 1; i <= 1000000; i++) {
                o = s.Split(';', 21);
            }
            sw.Stop();
            Console.WriteLine("Item directly selected: " + sw.ElapsedMilliseconds);


            sw.Restart();
            for (int i = 1; i <= 1000000; i++) {
                o = s.MyExtension(';', 21);
            }
            sw.Stop();
            Console.WriteLine("Item directly selected by MyExtension: " + sw.ElapsedMilliseconds);

            sw.Restart();
            for (int i = 1; i <= 1000000; i++) {
                o = s.Split(';')[21];
            }
            sw.Stop();
            Console.WriteLine("Item from split array:  " + sw.ElapsedMilliseconds + "\r\n");


            Console.WriteLine(s);
            Console.WriteLine(o);

}

public static class MyExtensions
{
    /// <summary>
    /// Get the nth item from a delimited string.
    /// </summary>
    /// <param name="s">The string to retrieve a delimited item from.</param>
    /// <param name="delimiter">The character used as the item delimiter.</param>
    /// <param name="n">Zero-based index of item to return.</param>
    /// <returns>The nth item or an empty string.</returns>
    public static string Split(this string s, char delimiter, int n)
    {

        int pos = pos = s.IndexOf(delimiter);

        if (n == 0 || pos < 0)
        { return (pos >= 0) ? s.Substring(0, pos) : s; }

        int nDelims = 1;

        while (nDelims < n && pos >= 0)
        {
            pos = s.IndexOf(delimiter, pos + 1);
            nDelims++;
        }

        string result = "";

        if (pos >= 0)
        {
            int nextDelim = s.IndexOf(delimiter, pos + 1);
            result = (nextDelim < 0) ? s.Substring(pos + 1) : s.Substring(pos + 1, nextDelim - pos - 1);
        }

        return result;
    }

    public static string MyExtension(this string s, char delimiter, int n)
    {
        var begin = n== 0 ? 0 : Westwind.Utilities.StringUtils.IndexOfNth(s, delimiter, n);
        if (begin == -1)
            return null;
        var end = s.IndexOf(delimiter, begin +  (n==0?0:1));
        if (end == -1 ) end = s.Length;
        //var end = Westwind.Utilities.StringUtils.IndexOfNth(s, delimiter, n + 1);
        var result = s.Substring(begin +1, end - begin -1 );

        return result;
    }

}

Results:

Item directly selected: 277
Item directly selected by MyExtension: 114
Item from split array:  1297

A;B;;;;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z
S

Edit: Thanks to @Kalten, I enhanced solution further. Considerable difference has been seen on benchmark results.

Upvotes: 2

Andrew Morton
Andrew Morton

Reputation: 25013

There are some useful remarks relevant to this problem in the documentation for String.Split, although I wrote the following before discovering that.

One way to do it is to find a delimiter with String.IndexOf method - you can specify the index to start the search from, so it is possible to skip along the items without having to examine every character. (The examination of every character happens behind the scenes, but it's a little bit faster than doing it yourself.)

I made up an extension method by adding a new class named "ExtensionMethods.cs" to the solution with this content:

namespace ExtensionMethods
{
    public static class MyExtensions
    {
        /// <summary>
        /// Get the nth item from a delimited string.
        /// </summary>
        /// <param name="s">The string to retrieve a delimited item from.</param>
        /// <param name="delimiter">The character used as the item delimiter.</param>
        /// <param name="n">Zero-based index of item to return.</param>
        /// <returns>The nth item or an empty string.</returns>
        public static string Split(this string s, char delimiter, int n)
        {

            int pos = pos = s.IndexOf(delimiter);

            if (n == 0 || pos < 0)
            { return (pos >= 0) ? s.Substring(0, pos) : s; }

            int nDelims = 1;

            while (nDelims < n && pos >= 0)
            {
                pos = s.IndexOf(delimiter, pos + 1);
                nDelims++;
            }

            string result = "";

            if (pos >= 0)
            {
                int nextDelim = s.IndexOf(delimiter, pos + 1);
                result = (nextDelim < 0) ? s.Substring(pos + 1) : s.Substring(pos + 1, nextDelim - pos - 1);
            }

            return result;
        }

    }
}

And a small program to test it:

using System;
using System.Diagnostics;
using System.Linq;
using ExtensionMethods;

namespace ConsoleApp1
{

    class Program
    {

        static void Main(string[] args)
        {
            // test data...
            string s = string.Join(";", Enumerable.Range(65, 26).Select(c => (char)c));
            s = s.Insert(3, ";;;");

            string o = "";

            Stopwatch sw = new Stopwatch();

            sw.Start();
            for (int i = 1; i <= 1000000; i++) {
                o = s.Split(';', 21);
            }
            sw.Stop();
            Console.WriteLine("Item directly selected: " + sw.ElapsedMilliseconds);

            sw.Restart();
            for (int i = 1; i <= 1000000; i++) {
                o = s.Split(';')[21];
            }
            sw.Stop();
            Console.WriteLine("Item from split array:  " + sw.ElapsedMilliseconds + "\r\n");


            Console.WriteLine(s);
            Console.WriteLine(o);

            Console.ReadLine();

        }
    }
}

Sample output:

Item directly selected: 1016
Item from split array: 1345

A;B;;;;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z
S


Reference: How to: Implement and Call a Custom Extension Method (C# Programming Guide)

Upvotes: 2

Related Questions