Reputation: 614
Lets say we have two strings
str1 = "021A,775U,000A,021A,1U2,206B,240,249,255B,260,263B,280,294,2U1,306B,336B,345,427,440,442,474,477,4U4,500,508,523,543L,580,584,772,802,P55,VB" ;
str2 = "772+240+802+P55+263B+1U2+VB";
and want to know if every substring of str2 splitted by "+" is in str1.
Is there something "better" than doing this
string[] str1Arr = str1.Split(',');
foreach (var s in str2.Split('+'))
{
if (!str1Arr.Contains(s))
{
return false;
}
}
return true;
str2 would return true, while
str3 ="772+240+802+P55+263B+1U2+V";
would return false.
Edit:
Max length of str1 is ~1000 with all delimited strings being unique and sorted. Max lenght of str2 is ~700 (average 200) with substrings being unique but not sorted.
Here are some more strings:
"200B+201B+202+202B+203B+204B+205+205B+206+206B+207+207B+208+208B+209+209B+210+210B+211B+214B+216B+460+494+498+830+379"//false
"612+637+638+646+649+655+664+678+688+693+694+754+756+758+774+778+780+781+787+99R+B58+RQW+RRG+RTJ+RVI+RVJ+RVU+RVV"//false
"255B+2U1+775U+336B+P55+802"//true
"220+223+224+229+230+233+234+402+537+550+811+863+872+881+889+965+512L+516L+520L+521L+524L+526L+528L+581L+810L+812L+814L+816L+818L+830"//false
"255B"//true
Upvotes: 1
Views: 434
Reputation: 16574
Do not split str1
into an array, just use String.Contains()
to find the match.
string str1Fixed = "," + str1 + ",";
foreach (var s in str2.Split('+'))
{
if (!str1Fixed.Contains("," + s + ","))
return false;
}
return true;
For a small number of elements string Contains is going to be more efficient, especially in the case positive matches that are close to the start of the string.
This was my very first thought before I provided my previous answer but it was a gut feeling not supported by any empirical evidence. I also deleted that original answer because it didn't take into account exact matches (I didn't inject the commas into the search args)
It turns out that using string interpolation can save a few memory allocations, so it will even further enhance this solution:
string str1Fixed = $",{str1},";
foreach (var s in str2.Split('+'))
{
if (!str1Fixed.Contains($",{s},"))
return false;
}
return true;
The difference is marginal, but if you are executing a significant amount of iterations, its all going to add up.
But don't take my word for it, let us prove it!
Firstly, I will introduce you to our test cases, all of which have been tested to ensure compliance:
public bool Arrays(string optionalValuesCSV, string requiredValuesPSV)
{
string[] str1Arr = optionalValuesCSV.Split(',');
foreach (var s in requiredValuesPSV.Split('+'))
{
if (!str1Arr.Contains(s))
return false;
}
return true;
}
public bool ConcatenatedString(string optionalValuesCSV, string requiredValuesPSV)
{
string str1Fixed = "," + optionalValuesCSV + ",";
foreach (var s in requiredValuesPSV.Split('+'))
{
if (!str1Fixed.Contains("," + s + ","))
return false;
}
return true;
}
public bool FormattedString(string optionalValuesCSV, string requiredValuesPSV)
{
string str1Fixed = String.Format(",{0},", optionalValuesCSV);
foreach (var s in requiredValuesPSV.Split('+'))
{
if (!str1Fixed.Contains(String.Format(",{0},", s)))
return false;
}
return true;
}
public bool InterpolatedString(string optionalValuesCSV, string requiredValuesPSV)
{
string str1Fixed = $",{optionalValuesCSV},";
foreach (var s in requiredValuesPSV.Split('+'))
{
if (!str1Fixed.Contains($",{s},"))
return false;
}
return true;
}
public bool LinqExcept(string optionalValuesCSV, string requiredValuesPSV)
{
return !requiredValuesPSV.Split('+')
.Except(optionalValuesCSV.Split(','))
.Any();
}
public bool LinqHashsetExcept(string optionalValuesCSV, string requiredValuesPSV)
{
return !requiredValuesPSV.Split('+')
.Except(new HashSet<string>(optionalValuesCSV.Split(',')))
.Any();
}
public bool RegexMatchIterator(string optionalValuesCSV, string requiredValuesPSV)
{
Regex searchPattern = null;
foreach (var s in requiredValuesPSV.Split('+'))
{
searchPattern = new Regex("," + s + ",");
if (!searchPattern.IsMatch("," + optionalValuesCSV + ","))
return false;
}
return true;
}
public bool RegexMatchAllPattern(string optionalValuesCSV, string requiredValuesPSV)
{
// note: need to escape \b for word boundary
var pattern = "(?=.*?\\b" + requiredValuesPSV.Replace("+","\\b)(?=.*?\\b") + "\\b)";
Regex searchPattern = new Regex(pattern);
//return searchPattern.IsMatch("," + optionalValuesCSV + ","))
return searchPattern.IsMatch(optionalValuesCSV);
}
Most of this is from the thought process and comments to my previous answer:
Arrays
The OP logic to test against
ConcatenatedString
Use String.Contains()
on the str1
value instead of splitting it into an array of strings.
FormattedString
Same logic as ConcatenatedString but uses String.Format
instead of concatenating values,
InterpolatedString
Same logic as ConcatenatedString but uses String Interpolation instead of concatenating values,
String.Format
.LinqExcept This is the logic explained in my previous answer using arrays and LINQ Except()
LinqHashsetExcept
Same as LinqExcept but uses a HashSet<string>
instead of an array, the purpose of this is to remove duplicates. Its isn't relevant any more as the OP has been updated to state the items in the arrays will already be unique. It is an interesting case to leave in to see what overheads might be involved.
RegexMatchIterator
Use Regex.IsMatch()
instead of String.Contains
, other than that, same setup as the ConcatenatedString approach.
RegexMatchAllPattern
RegEx matching by changing the required string into a regular expression, no array iterations in this solution.
You might find these additional resources helpful regarding RegEx, many of the comments talk through why and when you might choose one over the other.:
Before we get started on some performance metrics, you will notice I have not included any tests relying on sorting the data. My expectation of this test set is that we will prove the effort it takes to split the strings into arrays and then iterating through the array to find the result will take longer than a simple String.Contains
or String.IndexOf
. There is an interesting discussion on the associated time complexities here.
NOTE: The following tests are executed on https://dotnetfiddle.net/ using the >NET 4.7.2 compiler. The actual numbers will be different in your operating environment but it is probably a good comparison to how CLR UDFs are executed from within SQL Server.
- It is a good comparison because the IO load is variable and shared across other isntances, at any point in time in a live SQL database there will be a varying number of other concurrent queries that will affect performance.
- Ignore the specific values, but look at the patterns they form.
The first benchmark test is to run each of these methods once for the original values of str1
and str2
, capturing the time it takes to execute a cold start then after executing 10 times we sample the next execution:
Method | Cold Start (ms) | Warm (ms) |
---|---|---|
Arrays | 0.6484 | 0.0088 |
ConcatenatedString | 0.0069 | 0.0024 |
FormattedString | 0.1939 | 0.0029 |
LinqExcept | 0.0104 | 0.0036 |
LinqHashsetExcept | 0.2009 | 0.0052 |
RegexMatchIterator | 0.0389 | 0.0216 |
RegexMatchAllPattern | 0.0929 | 0.0879 |
So far Concatenated String wins, but this isn't conclusive yet, I'm only running on dotnetfiddle, and each time I execute the test set the numbers come back different, but with a very similar distribution. We need to validate the test logic and identify base line execution patterns before we start to vary the inputs. We do that by running the same evaluation for different numbers of iterations.
The following data is captured from this fiddle: https://dotnetfiddle.net/QRPRSu
Mean Execution Time (ms) | 1 | 10 | 100 | 1000 | 10000 |
---|---|---|---|---|---|
Arrays | 0.0088 | 0.0036 | 0.003464 | 0.0034006 | 0.00300586 |
ConcatenatedString | 0.0024 | 0.00256 | 0.002541 | 0.0024637 | 0.00230981 |
FormattedString | 0.0029 | 0.00309 | 0.00308 | 0.0030005 | 0.00288773 |
LinqExcept | 0.0036 | 0.00442 | 0.004649 | 0.0050969 | 0.00436625 |
LinqHashsetExcept | 0.0052 | 0.00605 | 0.006088 | 0.0061603 | 0.00599868 |
RegexMatchIterator | 0.0216 | 0.02257 | 0.058507 | 0.022021 | 0.02383395 |
RegexMatchAllPattern | 0.0879 | 0.08973 | 0.087446 | 0.0904722 | 0.08860092 |
So I didn't expect this, but for the 100 iteration cycle there is a spike, presumably there is a <10:100 event that occurs which causes the execution time to spike, testing in the epochs less than or greater than 100 seem to smooth out this spike. Across all the iterations the values are still relative to each other but also very similar, enough that we don't need to try for a massive number of iterations with the same parameters to obtain consistent results.
On to the actual testing, this time using the new comparison strings:
string str1 = "021A,775U,000A,021A,1U2,206B,240,249,255B,260,263B,280,294,2U1,306B,336B,345,427,440,442,474,477,4U4,500508,523,543L,580,584,772,802,P55,VB" ;
string str2 = "772+240+802+P55+263B+1U2+VB";
string str3 = "772+240+802+P55+263B+508+VB"; // false
string str4 = "200B+201B+202+202B+203B+204B+205+205B+206+206B+207+207B+208+208B+209+209B+210+210B+211B+214B+216B+460+494+498+830+379";//false
string str5 = "612+637+638+646+649+655+664+678+688+693+694+754+756+758+774+778+780+781+787+99R+B58+RQW+RRG+RTJ+RVI+RVJ+RVU+RVV";//false
string str6 = "255B+2U1+775U+336B+P55+802";//true
string str7 = "220+223+224+229+230+233+234+402+537+550+811+863+872+881+889+965+512L+516L+520L+521L+524L+526L+528L+581L+810L+812L+814L+816L+818L+830";//false
string str8 = "255B";//true
Run this fiddle: https://dotnetfiddle.net/sFqleR
Method | TestCases | Iterations | TotalTests | Overall | Milliseconds | ColdStart | Mean | StdDev | Variance |
---|---|---|---|---|---|---|---|---|---|
Arrays | 7 | 1000 | 7000 | 00:00:00.0255972 | 25.5972 | 0.8965 | 0.00356150000000027 | 0.0495346716008814 | 0.049538210170847 |
ConcatenatedString | 7 | 1000 | 7000 | 00:00:00.0152603 | 15.2603 | 0.1847 | 0.00208621428571429 | 0.00935358891914583 | 0.00935425710423217 |
FormattedString | 7 | 1000 | 7000 | 00:00:00.0161606 | 16.1606 | 0.2099 | 0.00220965714285706 | 0.0040865872450029 | 0.00408687917537046 |
LinqExcept | 7 | 1000 | 7000 | 00:00:00.0308686 | 30.8686 | 3.8193 | 0.00430554285714287 | 0.00879058428853723 | 0.00879121225469787 |
LinqHashsetExcept | 7 | 1000 | 7000 | 00:00:00.0433723 | 43.3723 | 2.821 | 0.00609560000000002 | 0.00892103745152035 | 0.00892167473676254 |
RegexMatchIterator | 7 | 1000 | 7000 | 00:00:00.0816964 | 81.6964 | 0.5562 | 0.0115590857142857 | 0.0318308417422667 | 0.0318331156174521 |
RegexMatchAllPattern | 7 | 1000 | 7000 | 00:00:07.0529764 | 7052.9764 | 1.0156 | 1.00735934285714 | 1.07445298146738 | 1.07452973633274 |
ConcatenatedString still wins, but not by much over a String.Format
approach.
String.Format
is the lowest, that indicates that it is the most consistent over the test cases.Now lets try a full outer join between the optional and required values for a bit of variation. Then for fun lets reverse the strings and randomise the values to account for any bias in the sequence of the strings.
Fiddle: https://dotnetfiddle.net/m6YQv9
Method | TestCases | Iterations | TotalTests | Overall | Milliseconds | ColdStart | Mean | StdDev | Variance |
---|---|---|---|---|---|---|---|---|---|
Arrays | 1024 | 10 | 10240 | 00:00:00.0188368 | 18.8368 | 0.2294 | 0.00173652343749998 | 0.0128300901366091 | 0.0128307166517418 |
ConcatenatedString | 1024 | 10 | 10240 | 00:00:00.0157159 | 15.7159 | 0.2115 | 0.00143642578125 | 0.00719377438896546 | 0.00719412567320958 |
FormattedString | 1024 | 10 | 10240 | 00:00:00.0164970 | 16.497 | 0.2815 | 0.00151858398437492 | 0.0038494732983341 | 0.00384966127466541 |
LinqExcept | 1024 | 10 | 10240 | 00:00:00.0239897 | 23.9897 | 0.2688 | 0.00223395507812505 | 0.00772608535444904 | 0.00772646263234338 |
LinqHashsetExcept | 1024 | 10 | 10240 | 00:00:00.0292562 | 29.2562 | 0.2896 | 0.00274775390625008 | 0.00598778336636879 | 0.0059880757600192 |
RegexMatchIterator | 1024 | 10 | 10240 | 00:00:00.0800746 | 80.0746 | 0.4529 | 0.00772122070312506 | 0.0508215684930702 | 0.0508240501967361 |
RegexMatchAllPattern | 1024 | 10 | 10240 | 00:00:03.6379600 | 3637.96 | 0.5773 | 0.355110078125 | 0.378996464474246 | 0.379014971516495 |
Overall ConcatenatedString still wins, but this time the margin is even smaller, and the FormattedString solution is still the most consistent, which might indicate that for a very large set of data, it could be the most reliable.
Also have a look at the original Array based solution, that is only marginally more expensive than the FormattedString but has a comparatively high degree of variance, which means this result set might have been lucky.
I had dreams of generating a random set of terms that fit the patterns described and then injecting a random number of the terms but TBH i'm out of time, it was a fun adventure, but its chewed up my sunday and now its getting late for dinner.
When looking for performance tuning, you need to have some good metrics around what you are currently experiencing and a realistic idea of both what is acceptable and how much time you are willing to pay for this effort. As an example, I bet OP has already spent a number of hours on this task, add my 8 hours today and the collective hours of others who have read this post and considered it, then this solution is well over 12 hours cost to the team.
I always defer back to this xkcd comic: Is It Worth the Time?
Don't forget the time you spend finding the chart to look up what you save. And the time spent reading this reminder about the time spent. And the time trying to figure out if either of those actually make sense. Remember, every second counts toward your life total, including these right now.
The difference between the Array method and the ConcatenatedString is still very marginal, you would need a very large dataset to really observe any difference that this implementation can provide.
We've spent over 12 hours, so if this makes your query run 5 seconds faster, and you run this over 5 times a day, then in 5 years you break even with the time you spend working on this issue vs the cumulative time the user would have spent waiting for the query to complete if you have done nothing.
When working with SQL and CLR functions, there are other optimisations that you can do in the SQL database that can significantly improve the performance, CLR functions are hard for SQL to optimise, generally for a query that involves this sort of data it is easy to assume that improvements in the rest of the structure of the query or evaluating other filtering constraints before this type of lookup would save more time than what we have saved here today.
A Final Thought...
If thestr1
value is pretty consistent for all the items in the list, then an XML or JSON comparison inside the SQL can be far more efficient, you can reduce the cycles for each row by half by pre-formatting thestr1
value as a constant variable for the query, but you lose any benefit to this if you try to pass that data across the boundary into CLR.If this is frequently queried data used in search functions, similar to tags in tweets and other social media posts, then there are better architectural solutions to this problem, the simplest of which is to create a table of the individual terms so we can index them. The good news about SQL is that this implementation detail can be entirely managed in the database with triggers, that way the application and business domain logic do not need to be affected by this structural optimisation.
As you are using CLR, you should read over CLR vs. T-SQL: Performance considerations for some useful background. It's an older resource and SQL has evolved, but it's still relevant thinking.
So after dinner, it's time to check out .Net 6. I had started to writeup a great summary, because a couple of test runs showed that the original Arrays implementation was fastest for fixed values over 100K interations, which means all of this was potentially for nothing, but I can't replicate that result now :) Ultimately that means that this comparison is so fast with either of these options that you are better off looking elsewhere for optimisations if you need them.
Fiddle: https://dotnetfiddle.net/VoMeo9
If we graph those numbers, excluding the RegEx results and the HashSet you can see that over time, the average execution time converges to a much smaller range and the LinqExcept implementation even beats the formatted string implementation. This is interesting when you compare this to later results because it probably points to some special optimisations within dot net when we run the same calculation repeatedly.
String Interpolation is consistently close to the Concatenated String approach and even beats it twice in this series.
The final test is to re-visit the random results feed to see if Interpolation can get the win that I was hoping for:
Fiddle: https://dotnetfiddle.net/4N23s5
Method | TestCases | Iterations | TotalTests | Overall | Milliseconds | ColdStart | Mean | StdDev | Variance |
---|---|---|---|---|---|---|---|---|---|
Arrays | 1024 | 80 | 81920 | 00:00:00.0856801 | 85.6801 | 0.2904 | 0.000990795898437515 | 0.00214697922130186 | 0.002146992325543057 |
ConcatenatedString | 1024 | 80 | 81920 | 00:00:00.0637065 | 63.7065 | 0.1513 | 0.0007239416503907266 | 0.0014517292239806258 | 0.00145173808471375 |
FormattedString | 1024 | 80 | 81920 | 00:00:00.0732739 | 73.2739 | 0.3548 | 0.0008382824707032207 | 0.0016765147809772755 | 0.0016765250137051203 |
InterpolatedString | 1024 | 80 | 81920 | 00:00:00.0614540 | 61.454 | 0.2699 | 0.0006977844238282229 | 0.0017730743291170337 | 0.0017730851512029848 |
LinqExcept | 1024 | 80 | 81920 | 00:00:00.1139557 | 113.9557 | 0.8458 | 0.0013408093261717685 | 0.007293846685190374 | 0.007293891203705162 |
LinqHashsetExcept | 1024 | 80 | 81920 | 00:00:00.1413452 | 141.3452 | 0.349 | 0.0016796301269533514 | 0.001957306707138949 | 0.0019573186537003933 |
RegexMatchIterator | 1024 | 80 | 81920 | 00:00:00.4237531 | 423.7531 | 4.8922 | 0.005117004394530726 | 0.018711766419816867 | 0.018711880628421194 |
RegexMatchAllPattern | 1024 | 80 | 81920 | 00:00:08.2265704 | 8226.5704 | 1.1987 | 0.10035106689453285 | 0.10451169495402168 | 0.10451233284862492 |
It's the slimmest of margins... We'd call that a Bee's D!#@ from where I'm from... Except that this time we are measuring, and these numbers a significantly smaller than that (its ~4800um if you were wondering)
I had to try a few times to get the results I was looking for, due to the test cases being randomised, it makes sense that we'd get a different output with different executions. Here's the results from two observations:
Everyone loves a chart, so here's the final one:
Upvotes: 2
Reputation: 16574
It depends on the length of the strings, but you could get a minor optimisation by using a HashSet
to ensure that there are only unique entries in the first array
HashSet<string> str1Values = new HashSet<string>(str1.Split(','));
foreach (var s in str2.Split('+'))
{
if (!str1Values.Contains(s))
{
return false;
}
}
return true;
Or we can do the same in LINQ:
HashSet<string> str1Values = new HashSet<string>(str1.Split(','));
return str2.Split('+').All(s => str1Values.Contains(s));
We can also use LINQ Except() to do a similar comparison, with Except()
we want no results, this would indicate that all the values matched.
return !str2.Split('+')
.Except(new HashSet<string>(str1.Split(',')))
.Any();
You could also package this into an extension method to reduce some of the ambiguity:
public static class StringExtensions
{
/// <summary>Check if all the elements in this string exist in the comparison</summary>
/// <param name="listOfAllValues">the CSV string that contains all the possible values</param>
/// <param name="requriedElements">CSV of the values that all MUST exist in the source list</param>
/// <param name="requiredElementsDelimiter">[Optional] delimiter to use for the requriedElements string</param>
public static bool ContainsAll(this string listOfAllValues, string requiredElements, char requiredElementsDelimiter = '+')
{
return !requiredElements.Split(requiredElementsDelimiter)
.Except(new HashSet<string>(listOfAllValues.Split(',')))
.Any();
}
}
Execute it like this:
Console.WriteLine(str1.ContainsAll(str2));
Console.WriteLine(str1.ContainsAll(str2, '+'));
See this fiddle: https://dotnetfiddle.net/45tU6z It is semantically the same as your code and will break on the first non-matching value.
For small lists you could skip the array for the first string, and just use contains, it might not make a measurable difference though:
string str1Fixed = "," + str1 + ",";
foreach (var s in str2.Split('+'))
{
if (!str1Fixed.Contains("," + s + ","))
{
return false;
}
}
return true;
Or in LINQ
var str1Fixed = "," + str1 + ",";
return str2.Split('+').Select(s => "," + s + ",").All(s => str1Values.Contains(s));
NOTE: I had to add leading and trailing commas to ensure we get exact matches
I wouldn't resort to regular expressions for simple exact matching like this and suspect that we only need to split str1
into an array if there was going to be either a large number of comparisons (many elements in str2
) or a very large number of elements in str1
, at which point we could sort the first string and do some further optimisations, but you don't need to go that far until you can demonstrate performance issues.
The other consideration is object reuse. If this is in a comparison that is called many times inside another iterative block or loop, then there might be some additional optimisations, especially if one of the comparisons is relatively static.
For significantly large lists you could index the larger array by the first letter and sort lexicographically... But you really need to have a good justification for taking it this far.
Ideally by using the LINQ comparisons like Except
we are assuming that someone smarter than us has applied practical optimisations for us so that we can concentrate on expressing our logic and worry less about the performance considerations.
Upvotes: 1