jigga
jigga

Reputation: 614

Fastest way to check if the substrings of one string are in another string

Lets say we have two strings

    str1 = "021A,775U,000A,021A,1U2,206B,240,249,255B,260,263B,280,294,2U1,306B,336B,345,427,440,442,474,477,4U4,500,508,523,543L,580,584,772,802,P55,VB" ;

    str2 = "772+240+802+P55+263B+1U2+VB";

and want to know if every substring of str2 splitted by "+" is in str1.

Is there something "better" than doing this

    string[] str1Arr = str1.Split(',');
    foreach (var s in str2.Split('+'))
    {
        if (!str1Arr.Contains(s))
        {
             return false;
        } 
    }
    return true;

str2 would return true, while

   str3 ="772+240+802+P55+263B+1U2+V";

would return false.

Edit:

Max length of str1 is ~1000 with all delimited strings being unique and sorted. Max lenght of str2 is ~700 (average 200) with substrings being unique but not sorted.

Here are some more strings:

"200B+201B+202+202B+203B+204B+205+205B+206+206B+207+207B+208+208B+209+209B+210+210B+211B+214B+216B+460+494+498+830+379"//false
"612+637+638+646+649+655+664+678+688+693+694+754+756+758+774+778+780+781+787+99R+B58+RQW+RRG+RTJ+RVI+RVJ+RVU+RVV"//false
"255B+2U1+775U+336B+P55+802"//true
"220+223+224+229+230+233+234+402+537+550+811+863+872+881+889+965+512L+516L+520L+521L+524L+526L+528L+581L+810L+812L+814L+816L+818L+830"//false
"255B"//true

Upvotes: 1

Views: 434

Answers (2)

Chris Schaller
Chris Schaller

Reputation: 16574

tldr;

Do not split str1 into an array, just use String.Contains() to find the match.

string str1Fixed = "," + str1 + ",";
foreach (var s in str2.Split('+'))
{
    if (!str1Fixed.Contains("," + s + ","))
         return false;
}
return true;

For a small number of elements string Contains is going to be more efficient, especially in the case positive matches that are close to the start of the string.

This was my very first thought before I provided my previous answer but it was a gut feeling not supported by any empirical evidence. I also deleted that original answer because it didn't take into account exact matches (I didn't inject the commas into the search args)


Update: .Net 6 - String Interpolation

It turns out that using string interpolation can save a few memory allocations, so it will even further enhance this solution:

string str1Fixed = $",{str1},";
foreach (var s in str2.Split('+'))
{
    if (!str1Fixed.Contains($",{s},"))
         return false;
}
return true;

The difference is marginal, but if you are executing a significant amount of iterations, its all going to add up.


But don't take my word for it, let us prove it!

Firstly, I will introduce you to our test cases, all of which have been tested to ensure compliance:

public bool Arrays(string optionalValuesCSV, string requiredValuesPSV)
{
    string[] str1Arr = optionalValuesCSV.Split(',');
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        if (!str1Arr.Contains(s))
            return false;
    }
    return true;
}

public bool ConcatenatedString(string optionalValuesCSV, string requiredValuesPSV)
{
    string str1Fixed = "," + optionalValuesCSV + ",";
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        if (!str1Fixed.Contains("," + s + ","))
             return false;
    }
    return true;
}

public bool FormattedString(string optionalValuesCSV, string requiredValuesPSV)
{
    string str1Fixed = String.Format(",{0},", optionalValuesCSV);
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        if (!str1Fixed.Contains(String.Format(",{0},", s)))
             return false;
    }
    return true;
}

public bool InterpolatedString(string optionalValuesCSV, string requiredValuesPSV)
{
    string str1Fixed = $",{optionalValuesCSV},";
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        if (!str1Fixed.Contains($",{s},"))
             return false;
    }
    return true;
}

public bool LinqExcept(string optionalValuesCSV, string requiredValuesPSV)
{
    return !requiredValuesPSV.Split('+')
        .Except(optionalValuesCSV.Split(','))
        .Any();
}

public bool LinqHashsetExcept(string optionalValuesCSV, string requiredValuesPSV)
{
    return !requiredValuesPSV.Split('+')
        .Except(new HashSet<string>(optionalValuesCSV.Split(',')))
        .Any();
}
            
public bool RegexMatchIterator(string optionalValuesCSV, string requiredValuesPSV)
{
    Regex searchPattern = null;
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        searchPattern = new Regex("," + s + ",");
        if (!searchPattern.IsMatch("," + optionalValuesCSV + ","))
             return false;
    }
    return true;
}       

public bool RegexMatchAllPattern(string optionalValuesCSV, string requiredValuesPSV)
{
    // note: need to escape \b for word boundary
    var pattern = "(?=.*?\\b" + requiredValuesPSV.Replace("+","\\b)(?=.*?\\b") + "\\b)";
    Regex searchPattern = new Regex(pattern);
    //return searchPattern.IsMatch("," + optionalValuesCSV + ","))
    return searchPattern.IsMatch(optionalValuesCSV);
}   

Most of this is from the thought process and comments to my previous answer:

  • Arrays
    The OP logic to test against

  • ConcatenatedString
    Use String.Contains() on the str1 value instead of splitting it into an array of strings.

  • FormattedString
    Same logic as ConcatenatedString but uses String.Format instead of concatenating values,

    • I know from previous experience and analysis that concatenation is more efficient for very small numbers of templated arguments, but it is included here as a proof so you don't have to take my word for it.
  • InterpolatedString
    Same logic as ConcatenatedString but uses String Interpolation instead of concatenating values,

    • I've never tested the performance of this, I hope it is faster than String.Format.
    • I can't test this on dotNetFiddle for .Net 4.7 because the compiler doesn't support it, we can't test with .Net 6 or & because SQL Server CLR only supports .Net Framework, so this method is excluded from the current tests :(
  • LinqExcept This is the logic explained in my previous answer using arrays and LINQ Except()

  • LinqHashsetExcept
    Same as LinqExcept but uses a HashSet<string> instead of an array, the purpose of this is to remove duplicates. Its isn't relevant any more as the OP has been updated to state the items in the arrays will already be unique. It is an interesting case to leave in to see what overheads might be involved.

  • RegexMatchIterator
    Use Regex.IsMatch() instead of String.Contains, other than that, same setup as the ConcatenatedString approach.

  • RegexMatchAllPattern
    RegEx matching by changing the required string into a regular expression, no array iterations in this solution.

You might find these additional resources helpful regarding RegEx, many of the comments talk through why and when you might choose one over the other.:

Before we get started on some performance metrics, you will notice I have not included any tests relying on sorting the data. My expectation of this test set is that we will prove the effort it takes to split the strings into arrays and then iterating through the array to find the result will take longer than a simple String.Contains or String.IndexOf. There is an interesting discussion on the associated time complexities here.

NOTE: The following tests are executed on https://dotnetfiddle.net/ using the >NET 4.7.2 compiler. The actual numbers will be different in your operating environment but it is probably a good comparison to how CLR UDFs are executed from within SQL Server.

  • It is a good comparison because the IO load is variable and shared across other isntances, at any point in time in a live SQL database there will be a varying number of other concurrent queries that will affect performance.
  • Ignore the specific values, but look at the patterns they form.

The first benchmark test is to run each of these methods once for the original values of str1 and str2, capturing the time it takes to execute a cold start then after executing 10 times we sample the next execution:

Method Cold Start (ms) Warm (ms)
Arrays 0.6484 0.0088
ConcatenatedString 0.0069 0.0024
FormattedString 0.1939 0.0029
LinqExcept 0.0104 0.0036
LinqHashsetExcept 0.2009 0.0052
RegexMatchIterator 0.0389 0.0216
RegexMatchAllPattern 0.0929 0.0879

So far Concatenated String wins, but this isn't conclusive yet, I'm only running on dotnetfiddle, and each time I execute the test set the numbers come back different, but with a very similar distribution. We need to validate the test logic and identify base line execution patterns before we start to vary the inputs. We do that by running the same evaluation for different numbers of iterations.

The following data is captured from this fiddle: https://dotnetfiddle.net/QRPRSu

Mean Execution Time (ms) 1 10 100 1000 10000
Arrays 0.0088 0.0036 0.003464 0.0034006 0.00300586
ConcatenatedString 0.0024 0.00256 0.002541 0.0024637 0.00230981
FormattedString 0.0029 0.00309 0.00308 0.0030005 0.00288773
LinqExcept 0.0036 0.00442 0.004649 0.0050969 0.00436625
LinqHashsetExcept 0.0052 0.00605 0.006088 0.0061603 0.00599868
RegexMatchIterator 0.0216 0.02257 0.058507 0.022021 0.02383395
RegexMatchAllPattern 0.0879 0.08973 0.087446 0.0904722 0.08860092

So I didn't expect this, but for the 100 iteration cycle there is a spike, presumably there is a <10:100 event that occurs which causes the execution time to spike, testing in the epochs less than or greater than 100 seem to smooth out this spike. Across all the iterations the values are still relative to each other but also very similar, enough that we don't need to try for a massive number of iterations with the same parameters to obtain consistent results.

Baseline Benchmarking results

On to the actual testing, this time using the new comparison strings:

    string str1 = "021A,775U,000A,021A,1U2,206B,240,249,255B,260,263B,280,294,2U1,306B,336B,345,427,440,442,474,477,4U4,500508,523,543L,580,584,772,802,P55,VB" ;
    string str2 = "772+240+802+P55+263B+1U2+VB";
    string str3 = "772+240+802+P55+263B+508+VB"; // false
    string str4 = "200B+201B+202+202B+203B+204B+205+205B+206+206B+207+207B+208+208B+209+209B+210+210B+211B+214B+216B+460+494+498+830+379";//false
    string str5 = "612+637+638+646+649+655+664+678+688+693+694+754+756+758+774+778+780+781+787+99R+B58+RQW+RRG+RTJ+RVI+RVJ+RVU+RVV";//false
    string str6 = "255B+2U1+775U+336B+P55+802";//true
    string str7 = "220+223+224+229+230+233+234+402+537+550+811+863+872+881+889+965+512L+516L+520L+521L+524L+526L+528L+581L+810L+812L+814L+816L+818L+830";//false
    string str8 = "255B";//true

Run this fiddle: https://dotnetfiddle.net/sFqleR

Method TestCases Iterations TotalTests Overall Milliseconds ColdStart Mean StdDev Variance
Arrays 7 1000 7000 00:00:00.0255972 25.5972 0.8965 0.00356150000000027 0.0495346716008814 0.049538210170847
ConcatenatedString 7 1000 7000 00:00:00.0152603 15.2603 0.1847 0.00208621428571429 0.00935358891914583 0.00935425710423217
FormattedString 7 1000 7000 00:00:00.0161606 16.1606 0.2099 0.00220965714285706 0.0040865872450029 0.00408687917537046
LinqExcept 7 1000 7000 00:00:00.0308686 30.8686 3.8193 0.00430554285714287 0.00879058428853723 0.00879121225469787
LinqHashsetExcept 7 1000 7000 00:00:00.0433723 43.3723 2.821 0.00609560000000002 0.00892103745152035 0.00892167473676254
RegexMatchIterator 7 1000 7000 00:00:00.0816964 81.6964 0.5562 0.0115590857142857 0.0318308417422667 0.0318331156174521
RegexMatchAllPattern 7 1000 7000 00:00:07.0529764 7052.9764 1.0156 1.00735934285714 1.07445298146738 1.07452973633274

ConcatenatedString still wins, but not by much over a String.Format approach.

  • It is interesting that note that the variance for String.Format is the lowest, that indicates that it is the most consistent over the test cases.

Now lets try a full outer join between the optional and required values for a bit of variation. Then for fun lets reverse the strings and randomise the values to account for any bias in the sequence of the strings.

Fiddle: https://dotnetfiddle.net/m6YQv9

Method TestCases Iterations TotalTests Overall Milliseconds ColdStart Mean StdDev Variance
Arrays 1024 10 10240 00:00:00.0188368 18.8368 0.2294 0.00173652343749998 0.0128300901366091 0.0128307166517418
ConcatenatedString 1024 10 10240 00:00:00.0157159 15.7159 0.2115 0.00143642578125 0.00719377438896546 0.00719412567320958
FormattedString 1024 10 10240 00:00:00.0164970 16.497 0.2815 0.00151858398437492 0.0038494732983341 0.00384966127466541
LinqExcept 1024 10 10240 00:00:00.0239897 23.9897 0.2688 0.00223395507812505 0.00772608535444904 0.00772646263234338
LinqHashsetExcept 1024 10 10240 00:00:00.0292562 29.2562 0.2896 0.00274775390625008 0.00598778336636879 0.0059880757600192
RegexMatchIterator 1024 10 10240 00:00:00.0800746 80.0746 0.4529 0.00772122070312506 0.0508215684930702 0.0508240501967361
RegexMatchAllPattern 1024 10 10240 00:00:03.6379600 3637.96 0.5773 0.355110078125 0.378996464474246 0.379014971516495

Comparison of the different testing methodologies

Overall ConcatenatedString still wins, but this time the margin is even smaller, and the FormattedString solution is still the most consistent, which might indicate that for a very large set of data, it could be the most reliable.

Also have a look at the original Array based solution, that is only marginally more expensive than the FormattedString but has a comparatively high degree of variance, which means this result set might have been lucky.

  • The last test only had 10 iterations of each combination due to runtime execution constraints, so it's not the most reliable data.

I had dreams of generating a random set of terms that fit the patterns described and then injecting a random number of the terms but TBH i'm out of time, it was a fun adventure, but its chewed up my sunday and now its getting late for dinner.

When looking for performance tuning, you need to have some good metrics around what you are currently experiencing and a realistic idea of both what is acceptable and how much time you are willing to pay for this effort. As an example, I bet OP has already spent a number of hours on this task, add my 8 hours today and the collective hours of others who have read this post and considered it, then this solution is well over 12 hours cost to the team.

I always defer back to this xkcd comic: Is It Worth the Time?

Is It Worth the Time?

Don't forget the time you spend finding the chart to look up what you save. And the time spent reading this reminder about the time spent. And the time trying to figure out if either of those actually make sense. Remember, every second counts toward your life total, including these right now.

The difference between the Array method and the ConcatenatedString is still very marginal, you would need a very large dataset to really observe any difference that this implementation can provide.

We've spent over 12 hours, so if this makes your query run 5 seconds faster, and you run this over 5 times a day, then in 5 years you break even with the time you spend working on this issue vs the cumulative time the user would have spent waiting for the query to complete if you have done nothing.

When working with SQL and CLR functions, there are other optimisations that you can do in the SQL database that can significantly improve the performance, CLR functions are hard for SQL to optimise, generally for a query that involves this sort of data it is easy to assume that improvements in the rest of the structure of the query or evaluating other filtering constraints before this type of lookup would save more time than what we have saved here today.

A Final Thought...
If the str1 value is pretty consistent for all the items in the list, then an XML or JSON comparison inside the SQL can be far more efficient, you can reduce the cycles for each row by half by pre-formatting the str1 value as a constant variable for the query, but you lose any benefit to this if you try to pass that data across the boundary into CLR.

If this is frequently queried data used in search functions, similar to tags in tweets and other social media posts, then there are better architectural solutions to this problem, the simplest of which is to create a table of the individual terms so we can index them. The good news about SQL is that this implementation detail can be entirely managed in the database with triggers, that way the application and business domain logic do not need to be affected by this structural optimisation.

As you are using CLR, you should read over CLR vs. T-SQL: Performance considerations for some useful background. It's an older resource and SQL has evolved, but it's still relevant thinking.


Update: .Net 6

So after dinner, it's time to check out .Net 6. I had started to writeup a great summary, because a couple of test runs showed that the original Arrays implementation was fastest for fixed values over 100K interations, which means all of this was potentially for nothing, but I can't replicate that result now :) Ultimately that means that this comparison is so fast with either of these options that you are better off looking elsewhere for optimisations if you need them.

Fiddle: https://dotnetfiddle.net/VoMeo9

Fixed input test in .Net 6

Test leader board 1

If we graph those numbers, excluding the RegEx results and the HashSet you can see that over time, the average execution time converges to a much smaller range and the LinqExcept implementation even beats the formatted string implementation. This is interesting when you compare this to later results because it probably points to some special optimisations within dot net when we run the same calculation repeatedly.

String Interpolation is consistently close to the Concatenated String approach and even beats it twice in this series.

Chart of results in .Net 6

The final test is to re-visit the random results feed to see if Interpolation can get the win that I was hoping for:

Fiddle: https://dotnetfiddle.net/4N23s5

Method TestCases Iterations TotalTests Overall Milliseconds ColdStart Mean StdDev Variance
Arrays 1024 80 81920 00:00:00.0856801 85.6801 0.2904 0.000990795898437515 0.00214697922130186 0.002146992325543057
ConcatenatedString 1024 80 81920 00:00:00.0637065 63.7065 0.1513 0.0007239416503907266 0.0014517292239806258 0.00145173808471375
FormattedString 1024 80 81920 00:00:00.0732739 73.2739 0.3548 0.0008382824707032207 0.0016765147809772755 0.0016765250137051203
InterpolatedString 1024 80 81920 00:00:00.0614540 61.454 0.2699 0.0006977844238282229 0.0017730743291170337 0.0017730851512029848
LinqExcept 1024 80 81920 00:00:00.1139557 113.9557 0.8458 0.0013408093261717685 0.007293846685190374 0.007293891203705162
LinqHashsetExcept 1024 80 81920 00:00:00.1413452 141.3452 0.349 0.0016796301269533514 0.001957306707138949 0.0019573186537003933
RegexMatchIterator 1024 80 81920 00:00:00.4237531 423.7531 4.8922 0.005117004394530726 0.018711766419816867 0.018711880628421194
RegexMatchAllPattern 1024 80 81920 00:00:08.2265704 8226.5704 1.1987 0.10035106689453285 0.10451169495402168 0.10451233284862492

It's the slimmest of margins... We'd call that a Bee's D!#@ from where I'm from... Except that this time we are measuring, and these numbers a significantly smaller than that (its ~4800um if you were wondering)

I had to try a few times to get the results I was looking for, due to the test cases being randomised, it makes sense that we'd get a different output with different executions. Here's the results from two observations:

Results of Randomised testin in .Net 6

Everyone loves a chart, so here's the final one:

Mean Execution time for Random tests in .Net 6

Upvotes: 2

Chris Schaller
Chris Schaller

Reputation: 16574

It depends on the length of the strings, but you could get a minor optimisation by using a HashSet to ensure that there are only unique entries in the first array

HashSet<string> str1Values = new HashSet<string>(str1.Split(','));
foreach (var s in str2.Split('+'))
{
    if (!str1Values.Contains(s))
    {
         return false;
    } 
}
return true;

Or we can do the same in LINQ:

HashSet<string> str1Values = new HashSet<string>(str1.Split(','));
return str2.Split('+').All(s => str1Values.Contains(s));

We can also use LINQ Except() to do a similar comparison, with Except() we want no results, this would indicate that all the values matched.

return !str2.Split('+')
            .Except(new HashSet<string>(str1.Split(',')))
            .Any();

You could also package this into an extension method to reduce some of the ambiguity:

public static class StringExtensions
{
    /// <summary>Check if all the elements in this string exist in the comparison</summary>
    /// <param name="listOfAllValues">the CSV string that contains all the possible values</param>
    /// <param name="requriedElements">CSV of the values that all MUST exist in the source list</param>
    /// <param name="requiredElementsDelimiter">[Optional] delimiter to use for the requriedElements string</param>
    public static bool ContainsAll(this string listOfAllValues, string requiredElements, char requiredElementsDelimiter = '+')
    {
        return !requiredElements.Split(requiredElementsDelimiter)
            .Except(new HashSet<string>(listOfAllValues.Split(',')))
            .Any();
    }
}

Execute it like this:

Console.WriteLine(str1.ContainsAll(str2));
Console.WriteLine(str1.ContainsAll(str2, '+'));

See this fiddle: https://dotnetfiddle.net/45tU6z It is semantically the same as your code and will break on the first non-matching value.


For small lists you could skip the array for the first string, and just use contains, it might not make a measurable difference though:

string str1Fixed = "," + str1 + ",";
foreach (var s in str2.Split('+'))
{
    if (!str1Fixed.Contains("," + s + ","))
    {
         return false;
    } 
}
return true;

Or in LINQ

var str1Fixed = "," + str1 + ",";
return str2.Split('+').Select(s => "," + s + ",").All(s => str1Values.Contains(s));

NOTE: I had to add leading and trailing commas to ensure we get exact matches

I wouldn't resort to regular expressions for simple exact matching like this and suspect that we only need to split str1 into an array if there was going to be either a large number of comparisons (many elements in str2) or a very large number of elements in str1, at which point we could sort the first string and do some further optimisations, but you don't need to go that far until you can demonstrate performance issues.

The other consideration is object reuse. If this is in a comparison that is called many times inside another iterative block or loop, then there might be some additional optimisations, especially if one of the comparisons is relatively static.

For significantly large lists you could index the larger array by the first letter and sort lexicographically... But you really need to have a good justification for taking it this far.

Ideally by using the LINQ comparisons like Except we are assuming that someone smarter than us has applied practical optimisations for us so that we can concentrate on expressing our logic and worry less about the performance considerations.

Upvotes: 1

Related Questions