Iterating over multi-thousand element List

Question

case 15: {
    for (int i = 0; i < words.Count; i++) {
        if (words[i].Length == 8) {
            var tupled = words[i].ConcatCheck();
            for (int n = 0; n < word.Count; n++)
                if (word[n] == tupled.Item1 || word[n] == tupled.Item2)
                    temp++;
        }
        if (temp >= 2)
            matches.Add(words[i]);
        temp = 0;
    }
    break;
}

What it does:
The first 'for loop' iterates through a List of words about 248000 elements long, checking for words of length 8. When one is found, I create a Tuple of the first and last half of the word (4 letters per half) by calling the ConcatCheck() method (an Extension method I wrote for obj String). That part is fast and fine.

What really needs work is the second 'for loop'. Every single 8 letter word activates this loop, which iterates over an even larger List of about 267000 elements, checking to see if both Items of the Tuple exist. If both are found, I have the original word added to a list "matches".

This part takes almost 3 minutes to find all matches in the 248k dictionary I have. Any way to optimize/speed it up?

Manfred Radlwimmer · Accepted Answer

If you simply want to check if a word exists in a collection, use a HashSet instead of a List or Array. The HashSet class is optimized for Contains checks.

Example

With the follwing code I found all 8 letter words made up of two 4 letter words in the english dictionary (github version) in under 50 ms.

WebClient client = new WebClient();
string dictionary = client.DownloadString(
    @"https://raw.githubusercontent.com/dwyl/english-words/master/words.txt");

Stopwatch watch = new Stopwatch();
watch.Start();

HashSet fourLetterWords = new HashSet();

using (StringReader reader = new StringReader(dictionary))
{
    while (true)
    {
        string line = reader.ReadLine();
        if (line == null) break;
        if (line.Length != 4) continue;

        fourLetterWords.Add(line);
    }
}

List matches = new List();

using (StringReader reader = new StringReader(dictionary))
{
    while (true)
    {
        string line = reader.ReadLine();
        if (line == null) break;
        if (line.Length != 8) continue;

        if (fourLetterWords.Contains(line.Substring(0, 4)) &&
            fourLetterWords.Contains(line.Substring(4, 4)))
            matches.Add(line);
    }
}

watch.Stop();

Why is your code so slow?

for (int n = 0; n < word.Count; n++)
    if (word[n] == tupled.Item1 || word[n] == tupled.Item2)
        temp++;

This part is one of the culprits. Instead of checking Are both parts contained in my array? you are checking Are 2 or more of my 2 words contained in an array?.

You could optimize this part by breaking the loop once you found both words.

if (word[n] == tupled.Item1 || word[n] == tupled.Item2)
    if(++temp >= 2) break;

Further optimizations could be made (depending on how often you run this search) by pre-sorting your words by length or alphabetically.

Iterating over multi-thousand element List

Answers (2)

Related Questions