Persson
Persson

Reputation: 422

Loop through very large List<string> with over 4 million entries in C#


I'm looking for ideas and/or improvements of how to loop through a very large `List` in C#.

I have a List<string> named excludes that contains over 4 million strings.
I have a string named message which is a string generated by user input.

I need to check if the user input (message) contains any of the strings in the list (excludes) and if so, remove the matched string(s) found in the string.

Example:

private static List<string> excludes = new List<string>(); // --> 4 million entries
string message = "Hello, how are you this fine day?\r\nMy name is SO."; // User input

foreach (string exclude in excludes)
{
    if (message.Contains(exclude))
    {
        message = message.Replace(exclude, "");
    }
}

On avarage this process takes about 350ms to complete because of the size of the list.
Is there anyway I could improve my code and reduce the time required to complete this task?

Upvotes: 0

Views: 1418

Answers (4)

Caius Jard
Caius Jard

Reputation: 74605

Disclaimer: I haven't looked at the word list-it didn't load on my cellphone. I imagine, from the comments, it's like a csv list of 4 million "words"; the sort of "words" you might see if you had a million monkeys bashing on a million typewriters, probably because that's some corollary to how they were actually generated 😆

In the comments you seemed to indictate that the words in the input string are separated by spaces otherwise they don't show up as emojis. As such I'd load the 4 million exclusions into a hashset, split the string into words, remove words in the hashset then recombine the result:


private static HashSet<string> excludes = new HashSet<string>(); // --> 4 million entries, load it once


string message = "Hello, how are you this fine day?\r\nMy name is SO."; // User input

var bits = input.Split(' ');

for (int x = 0; x < bits.Length; x++)
{
    if (exclude.Contains(bits[i]))
    {
        bits[i] = null;
    }
}

var result = string.Join(" ", bits);

This just splits on space, then it knows it can recompose it using space. If your input will have other characters (I can see you have an \r\n there which would defeat this) then you probably want to look at splitting but keeping the delimiters so that you can get your split tokens, replace some, then do a string.Concat instead of join. If you want to wring every last millisecond out of it, then you probably need to look at shifting the solution to Spans but this simple start might provide you with something to investigate whether it's an avenue worth pursuing.

All in I think it's important to tokenize your input and then check entries in their entirety rather than just perform naive repeated replacements, because if your word list contains "hel", "lo", "orl" and "wd" then "hello world" will be reduced to just the separating space even though it contains none of those words. This is because eg replacing "orl" in "world" with nothing creates "wd" which never existed. Also important to note that if the replacements were performed in a different order you'd get a different result ("hello" would disappear but "world" would become "wd" if the "orl" replacement was done last)

Worth noting that hashset is by default case sensitive. do t think you said whether your LULwuts are case sens or not. If they're not, make the hashset case insens

Upvotes: 2

Charlieface
Charlieface

Reputation: 71544

Here are a few performance boosts, for which I'll give timings on my machine.

  • Just try a replacement, without checking if it's contained
foreach(var exclude in _list)
{
    message = message.Replace(exclude, "");
}
  • You could parallelize it.
    This will hopefully speed it up by a factor the number of cores you have, although high numbers of replacements will slow it down, due to locking.
var lockObject = new object();
Parallel.ForEach(excludes, exclude => {
    if (message.Contains(exclude))
    {
        lock(lockObject)
        {
            message = message.Replace(exclude, "");
        }
    }
});
  • You could parallelize it and just replace. Bit more complicated, as it requires a spin-lock in case of replacement
Parallel.ForEach(_list, exclude => {
    string message2;
    string message3;
    do
    {
        message2 = message;
        message3 = message2.Replace(exclude, "");
        if(object.ReferenceEquals(message2, message3))
            return;
            
    }while(Interlocked.CompareExchange(ref message, message3, message2) != message2);
});
Method Timings (sec)
ContainReplace original 0.43
JustReplace 0.3
ParallelContainReplace 0.12
ParallelJustReplace 0.1

Upvotes: 1

Andriy Shevchenko
Andriy Shevchenko

Reputation: 1131

I'd look at a Bloom filters concept. It works almost at O(1) and uses very few memory.

For example, there's is a code sample in C# and a NuGet package.

But pay attention it's a probabilistic data structure and not always yields to a correct result.

Upvotes: 1

harleybl
harleybl

Reputation: 959

You could load the strings into a Trie. Looking up a word to see if it is an excluded word would take at most N comparisons where N is the length of the word.

Here are some resources:

https://www.geeksforgeeks.org/trie-insert-and-search/

https://en.wikipedia.org/wiki/Trie

Upvotes: 0

Related Questions