Reputation: 422
I'm looking for ideas and/or improvements of how to loop through a very large `List` in C#.
I have a List<string>
named excludes
that contains over 4 million strings.
I have a string named message
which is a string generated by user input.
I need to check if the user input (message) contains any of the strings in the list (excludes) and if so, remove the matched string(s) found in the string.
Example:
private static List<string> excludes = new List<string>(); // --> 4 million entries
string message = "Hello, how are you this fine day?\r\nMy name is SO."; // User input
foreach (string exclude in excludes)
{
if (message.Contains(exclude))
{
message = message.Replace(exclude, "");
}
}
On avarage this process takes about 350ms to complete because of the size of the list.
Is there anyway I could improve my code and reduce the time required to complete this task?
Upvotes: 0
Views: 1418
Reputation: 74605
Disclaimer: I haven't looked at the word list-it didn't load on my cellphone. I imagine, from the comments, it's like a csv list of 4 million "words"; the sort of "words" you might see if you had a million monkeys bashing on a million typewriters, probably because that's some corollary to how they were actually generated 😆
In the comments you seemed to indictate that the words in the input string are separated by spaces otherwise they don't show up as emojis. As such I'd load the 4 million exclusions into a hashset, split the string into words, remove words in the hashset then recombine the result:
private static HashSet<string> excludes = new HashSet<string>(); // --> 4 million entries, load it once
string message = "Hello, how are you this fine day?\r\nMy name is SO."; // User input
var bits = input.Split(' ');
for (int x = 0; x < bits.Length; x++)
{
if (exclude.Contains(bits[i]))
{
bits[i] = null;
}
}
var result = string.Join(" ", bits);
This just splits on space, then it knows it can recompose it using space. If your input will have other characters (I can see you have an \r\n there which would defeat this) then you probably want to look at splitting but keeping the delimiters so that you can get your split tokens, replace some, then do a string.Concat
instead of join. If you want to wring every last millisecond out of it, then you probably need to look at shifting the solution to Spans but this simple start might provide you with something to investigate whether it's an avenue worth pursuing.
All in I think it's important to tokenize your input and then check entries in their entirety rather than just perform naive repeated replacements, because if your word list contains "hel", "lo", "orl" and "wd" then "hello world" will be reduced to just the separating space even though it contains none of those words. This is because eg replacing "orl" in "world" with nothing creates "wd" which never existed. Also important to note that if the replacements were performed in a different order you'd get a different result ("hello" would disappear but "world" would become "wd" if the "orl" replacement was done last)
Worth noting that hashset is by default case sensitive. do t think you said whether your LULwuts are case sens or not. If they're not, make the hashset case insens
Upvotes: 2
Reputation: 71544
Here are a few performance boosts, for which I'll give timings on my machine.
foreach(var exclude in _list)
{
message = message.Replace(exclude, "");
}
var lockObject = new object();
Parallel.ForEach(excludes, exclude => {
if (message.Contains(exclude))
{
lock(lockObject)
{
message = message.Replace(exclude, "");
}
}
});
Parallel.ForEach(_list, exclude => {
string message2;
string message3;
do
{
message2 = message;
message3 = message2.Replace(exclude, "");
if(object.ReferenceEquals(message2, message3))
return;
}while(Interlocked.CompareExchange(ref message, message3, message2) != message2);
});
Method | Timings (sec) |
---|---|
ContainReplace original | 0.43 |
JustReplace | 0.3 |
ParallelContainReplace | 0.12 |
ParallelJustReplace | 0.1 |
Upvotes: 1
Reputation: 1131
I'd look at a Bloom filters concept. It works almost at O(1)
and uses very few memory.
For example, there's is a code sample in C# and a NuGet package.
But pay attention it's a probabilistic data structure and not always yields to a correct result.
Upvotes: 1
Reputation: 959
You could load the strings into a Trie. Looking up a word to see if it is an excluded word would take at most N comparisons where N is the length of the word.
Here are some resources:
https://www.geeksforgeeks.org/trie-insert-and-search/
https://en.wikipedia.org/wiki/Trie
Upvotes: 0