Reputation: 7067
I have a very huge dictionary and the content inside it looks like (headers not included in the dictionary):
(code) (names)
------------------------------
910235487 Diabetes, tumors, sugar sick, .....
I have more than 150K lines of this kind of pairs in the dictionary.
The user input is key words (diagnosis names), I cannot search the dictionary by keys.
Here is the code:
var relevantIDs = this.dic.Where(ele => ele.Value.Contains(keyword))
.Select(n => Convert.ToUInt64(n.Key));
The dictionary is Dictionary<string, string>
and I have to use string as the data type of key, because the codes can sometimes contain characters. The names column contains a list of relevant diagnosis names. So I cannot change this data type either.
I think the problem is for each value of a pair, I did the Contains
operation which slows down the who process, but I cannot find an alternative way to do so...
This is what I did in order to find the matched codes.
But the performance of this code is terrible (it takes around 5 mins to finish this single line of code).
Can someone help?
Update and simplest solution
I finally found the season why the search is so slow, and solved it by doing so:
var relevantStringIDs = this.dic.Where(ele => ele.Value.Contains(keyword)).Tolist();
var relevantUlongIDs = relevantStringIDs.Select(n => Convert.ToUInt64(n.Key)).Tolist();
The reason why it was that slow is this.dic.Where(ele => ele.Value.Contains(keyword))
, it will be executed every time whenever the second part of the query is executed (this is the feature of IEnumerable<T>
, I forget the term for it (maybe delayed execution)). So I use ToList()
to convert the delayed query to a concrete list in the memory so that the result can be reused when converting strings to ulongs
, rather than execute the query again for each conversion.
Please correct me if you found something wrong in this explanation.
By the way, although this may not be the best solution but the performance of the changed code is quiet satisfactory. The first statement of the code only costs 169 ms which is quick enough for me.
Upvotes: 3
Views: 4506
Reputation: 17755
Your problem is that you lose all of the speed benefits of the dictionary by iterating over it's values. Dictionaries are optimized for key lookups.
I would approach this with a different datatype and optimize it for your keyword lookups.
Here's an example using LINQ to create a Lookup from data similar to yours. In this case, I'm building it directly from string data which avoids the dictionary entirely.
This type of lookup should perform much better.
string [] lines = {
"123 A, B, C, D",
"456 E, F, G",
"321 A, E, H, I",
"654 B, G",
"789 A, J, K, L",
"987 A, M, L, E"
};
var lookup = lines.SelectMany (
l => (l.Split(new char[]{' '},2)[1]).Split(',').Select (v => v.Trim().ToLower()).ToArray(),
(l,o) => new{
keyword = o,
code = Convert.ToInt64(l.Split(' ')[0])
}).ToLookup(k => k.keyword, v => v.code).Dump();
Console.WriteLine(String.Join(",",lookup["a"]));
Console.WriteLine(String.Join(",",lookup["l"]));
Console.WriteLine(String.Join(",",lookup["b"]));
Note that this assumes you are looking up whole keywords (your initial example could lookup partial keywords)
Upvotes: 2
Reputation: 54877
You're doing it the wrong way round. Dictionaries permit efficient lookups when you know the key, not the value.
A simple way of fixing performance would be to construct a reverse dictionary mimicking a full-text index over your content:
var dic = new Dictionary<string, string>();
dic.Add("910235487", "Diabetes, tumors, sugar sick");
dic.Add("120391052", "Fever, diabetes");
char[] delimiters = new char[] { ' ', ',' };
var wordCodes =
from kvp in dic
from word in kvp.Value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries)
let code = long.Parse(kvp.Key)
select new { Word = word, Code = code };
var fullTextIndex =
wordCodes.ToLookup(wc => wc.Word, wc => wc.Code, StringComparer.OrdinalIgnoreCase);
long[] test1 = fullTextIndex["sugar"].ToArray(); // Gives 910235487
long[] test2 = fullTextIndex["diabetes"].ToArray(); // Gives 910235487, 120391052
The construction of the full-text index will take a long time; however, this is a one-off cost, and will be amortized through the time savings of subsequent lookups.
Upvotes: 5