thangnv
thangnv

Reputation: 83

Efficient way to look up dictionary

I have a dictionary, it map word to id, like:

at: 0
hello: 1
school: 2
fortune:3
high:4
we: 5
eat: 6
....
high_school: 17
fortune_cookie: 18
....

Then, i have a document. What is the fastest and efficient way to transfer content of document to id. Eg:

"At high school, we eat fortune cookie."
=>  "0 17, 5 6 18"

Hope to see your suggestion. Thank for readinng.

Upvotes: 0

Views: 4637

Answers (2)

Cybercartel
Cybercartel

Reputation: 12592

You can try a trie data structure or a red-black tree if the document hasn't so much duplicates. A trie is much less expensive. You can also combine a trie with a wildcard: http://phpir.com/tries-and-wildcards

Upvotes: 3

Jim Mischel
Jim Mischel

Reputation: 134035

It really depends on how large your document is, whether your keyword list is static, and whether you need to find multi-word phrases. The naive way to do it is to look up every word from the document in the dictionary. Because dictionary lookups are O(1), looking up every word will take O(n) time, where n is the number of words in the document. If you need to find multi-word phrases, you can post-process the output to find those.

That's not the most efficient way to do things, but it's really easy to implement, reasonably fast, and will work very well if your documents aren't huge.

If you have very large documents, then you probably want something like the Aho-Corasick string matching algorithm. That algorithm works in two stages. First it builds a trie from the words in your dictionary, and then it makes a single pass through the document and outputs all of the matches. It's more complicated to implement than the naive method, but it works very well once the trie is built. And, truth to tell, it's not that hard to implement. The original paper, which is linked from the Wikipedia article, explains the algorithm well and it's not difficult to convert their pseudocode into a working program.

Note, however, that you might get some unexpected results. For example, if your dictionary contains the words "high" and "school" as well as the two-word phrase "high school", the Aho-Corasick will give you matches for all three when it sees the phrase "high school".

Upvotes: 3

Related Questions