Reputation: 20119
This sounds like a simple question, but I don't know how to search for its answer.
I have a trie implementation in C# that will store about 80K words from a dictionary file. It takes quite a while to load all these words (more than 5 mins). I was wondering, what is the best way to "persist" those data so I don't have to reload all words every time I start the application?
Thanks.
Upvotes: 1
Views: 4482
Reputation: 40649
I would just serialize it in the old MFC binary fashion. Basically the reading/writing should be about as fast as possible, and the only thing you're left with is allocating and initializing the structure on input, which you need to do anyway.
That is, to serialize a node of the trie, you do this:
Read/Write number N of subnodes
For each subnode
If reading, allocate a subnode in this node
Read/Write the character for the subnode
Serialize the subnode
End
Edit: Just re-read your question, and you want to build the trie from scratch from the wordlist? As others said, profile, but not just with any old profiler. They don't all find your problem. Here's what I do. The time it takes should not be much more than the time it takes to read the file plus the time it takes to create the structure.
Upvotes: 0
Reputation: 113402
Like all other performance issues, the ideal solution will follow from profiling your current solution and other candidate solutions that you come up with. Where's the bottleneck? The I/O? Lexing the text? Forming the links in the trie? Will be hard to make a concrete suggestion without knowing your performance goals, the nature of the trie-usage and bottlenecks currently present.
Issues to consider:
One possible strategy: Create and persist a 'most common words' dictionary with the 1,000 (or so) of the most frequently-used words. Load these words into the trie on start-up, and spawn the loading of the full-dictionary on another thread; incrementally adding to the created trie as new words are read.
Upvotes: 5
Reputation: 28698
I recently refactored a similar data structure, due to slow performance and slow serialization / deserialization times.
My solution was to scrap the trie altogether and go with native .NET collections - Dictionaries and Lookups.
I'm working with about 400k words. From memory it takes about 5 seconds to build the data structure, which is a list of objects indexed by a number of dictionaries and lookups.
Dictionary<int, var>
where the key
is n - the number of letters in the
search term. Lookup<string,
string>
where the key is a string
with n letters, and the value is all
strings that start with that string.
e.g for key 'st' values might be
'start', 'stop' and 'string'.To create the data structure I simply iterate over the entire list of words for i = 1 to maxlength to create a Lookup of all distinct 'starts with' strings for each i. Plug those into the top level dictionary and you're done.
This removes the need for a custom-built trie. I found the performance difference (search time) to be neglible, but the speed of loading to hugely favour my design (not to mention simplicity and maintainability of using simple .NET types).
Upvotes: 2