sorh
sorh

Reputation: 125

Split string at first occurence of a letter without regex in C#

I have lots of strings, that all start like CIDR IP addresses (e.g. 192.168.1.104), but some of them have random letters at the end (e.g. 192.168.1.104kadjwneqb). Is there a way to split these strings at the first ocurrence of a letter, without using regex? Regex are too intensive to compute because I need to process a lot of these. Thank you in advance

Upvotes: 0

Views: 70

Answers (3)

Icemanind
Icemanind

Reputation: 48686

Probably the easiest way to do this using no RegEx or loops is something like this:

using System;

public class Program
{
    public static void Main()
    {
        string inputIp="192.168.12.127fjieif34f";

        int firstNumber = inputIp.IndexOfAny("0123456789".ToCharArray());
        int firstAlpha = inputIp.IndexOfAny("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".ToCharArray(), firstNumber);

        string ip = inputIp.Substring(firstNumber, firstAlpha - firstNumber);

        Console.WriteLine("The IP is " + ip);
    }
}

This is how this works:

  1. It finds the index of the first number (since ip addresses start with a number)
  2. It then finds the index of the first alpha character.
  3. It uses Substring to extract the ip nestled between firstNumber and firstAlpha.

This simple example doesn't do any kind of checking, which you might want to do (such as checking the return value of IndexOfAny).

Upvotes: 1

Matthew Watson
Matthew Watson

Reputation: 109567

According to my testing, a custom loop is around five times faster than the regex you're using (although there are likely regexes that could be a bit faster).

I tested using BenchmarkDotNet, the result being:

|                                Method |      Mean |     Error |    StdDev |
|-------------------------------------- |----------:|----------:|----------:|
| BenchTruncateAtLastDigitViaCustomCode |  4.846 ms | 0.0531 ms | 0.0652 ms |
|      BenchTruncateAtLastDigitViaRegex | 21.886 ms | 0.2421 ms | 0.2265 ms |

And the test code:

using System;
using System.Collections.Generic;
using System.Net;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;

namespace CoreConsoleA
{
    public class UnderTest
    {
        public UnderTest()
        {
            var rng = new Random(12456); // Seeded RNG, so same data every run.

            // Test with 100,000 IP addresses of which 10% have between 1 and 20 extra characters.

            var data = new byte[4]; 
            int n = 100_000;

            for (int i = 0; i < n; ++i)
            {
                rng.NextBytes(data); // Create a random 
                IPAddress ipAddress = new IPAddress(data);

                if (rng.NextDouble() < 0.1)
                {
                    int extra = rng.Next(1, 21);
                    _strings.Add(ipAddress + new string('x', extra));
                }
                else
                {
                    _strings.Add(ipAddress.ToString());
                }
            }
        }

        [Benchmark]
        public void BenchTruncateAtLastDigitViaCustomCode()
        {
            foreach (var s in _strings)
            {
                TruncateAtLastDigitViaCustomCode(s);
            }
        }

        [Benchmark]
        public void BenchTruncateAtLastDigitViaRegex()
        {
            foreach (var s in _strings)
            {
                TruncateAtLastDigitViaRegex(s);
            }
        }

        public string TruncateAtLastDigitViaCustomCode(string s)
        {
            return s.Substring(0, IndexOfLastDigitViaCustomCode(s));
        }

        public string TruncateAtLastDigitViaRegex(string s)
        {
            return s.Substring(0, IndexOfLastDigitViaRegex(s));
        }

        public static int IndexOfLastDigitViaCustomCode(string s)
        {
            for (int i = 0; i < s.Length; ++i)
            {
                char c = s[i];

                if (!char.IsDigit(c) && c != '.')
                    return i;
            }

            return s.Length;
        }

        public int IndexOfLastDigitViaRegex(string s)
        {
            int index = _ipTruncate.Match(s).Index;

            return index > 0 ? index : s.Length;
        }

        readonly Regex _ipTruncate = new Regex("[^0-9.]", RegexOptions.Compiled);

        readonly List<string> _strings = new List<string>();
    }
}

Upvotes: 2

jeroenh
jeroenh

Reputation: 26782

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Several comments claiming that regex should be fine, so I tested. Compiled in release mode, on my machine the regex approach is 10x slower. May still be ok, depending on the context of course. But IMO the naive implementation (which just looks for the non-digit, non-decimal point character in the string, then returns a substring) is also way simpler to understand. YMMV.

static class Program
{
    static void Main(string[] args)
    {
        var input = "192.168.1.104kadjwneqb";
        Timed(() => GetIp1(input));
        Timed(() => GetIp2(input));
    }
    static Regex regex = new Regex(@"^(?<cidr>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})[a-zA-Z]*$", RegexOptions.Compiled);

    static void Timed(Action a)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < 10_000_000; i++)
            a();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }

    static string GetIp1(string input)
    {
        int i = 0;
        while (char.IsDigit(input[i]) || input[i] == '.') i++;
        return input.Substring(0, i);
    }
    static string GetIp2(string input)
    {
        var m = regex.Match(input);
        return m.Groups["cidr"].Value;
    }
}

Upvotes: 1

Related Questions