How to speed up search in a huge dictionary

Question

I have a very huge dictionary and the content inside it looks like (headers not included in the dictionary):

(code)        (names)
------------------------------
910235487     Diabetes, tumors, sugar sick, .....

I have more than 150K lines of this kind of pairs in the dictionary.

The user input is key words (diagnosis names), I cannot search the dictionary by keys.

Here is the code:

var relevantIDs = this.dic.Where(ele => ele.Value.Contains(keyword))
                          .Select(n => Convert.ToUInt64(n.Key));

The dictionary is Dictionary and I have to use string as the data type of key, because the codes can sometimes contain characters. The names column contains a list of relevant diagnosis names. So I cannot change this data type either.

I think the problem is for each value of a pair, I did the Contains operation which slows down the who process, but I cannot find an alternative way to do so...

This is what I did in order to find the matched codes.

But the performance of this code is terrible (it takes around 5 mins to finish this single line of code).

Can someone help?

Update and simplest solution

I finally found the season why the search is so slow, and solved it by doing so:

var relevantStringIDs = this.dic.Where(ele => ele.Value.Contains(keyword)).Tolist();
var relevantUlongIDs = relevantStringIDs.Select(n => Convert.ToUInt64(n.Key)).Tolist();

The reason why it was that slow is this.dic.Where(ele => ele.Value.Contains(keyword)), it will be executed every time whenever the second part of the query is executed (this is the feature of IEnumerable, I forget the term for it (maybe delayed execution)). So I use ToList() to convert the delayed query to a concrete list in the memory so that the result can be reused when converting strings to ulongs, rather than execute the query again for each conversion.
Please correct me if you found something wrong in this explanation.

By the way, although this may not be the best solution but the performance of the changed code is quiet satisfactory. The first statement of the code only costs 169 ms which is quick enough for me.

Douglas · Accepted Answer

You're doing it the wrong way round. Dictionaries permit efficient lookups when you know the key, not the value.

A simple way of fixing performance would be to construct a reverse dictionary mimicking a full-text index over your content:

var dic = new Dictionary();
dic.Add("910235487", "Diabetes, tumors, sugar sick");
dic.Add("120391052", "Fever, diabetes");

char[] delimiters = new char[] { ' ', ',' };

var wordCodes =
    from kvp in dic
    from word in kvp.Value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries)
    let code = long.Parse(kvp.Key)
    select new { Word = word, Code = code };

var fullTextIndex =
    wordCodes.ToLookup(wc => wc.Word, wc => wc.Code, StringComparer.OrdinalIgnoreCase);

long[] test1 = fullTextIndex["sugar"].ToArray();       // Gives 910235487
long[] test2 = fullTextIndex["diabetes"].ToArray();    // Gives 910235487, 120391052

The construction of the full-text index will take a long time; however, this is a one-off cost, and will be amortized through the time savings of subsequent lookups.

How to speed up search in a huge dictionary

Answers (2)

Related Questions