Reputation: 269
I have a dictionary containing a list of words and I have a string URL. I want to find all the words contained in the URL after it has been decomposed into tokens using delimiters. Right now, I am testing for each word in the dictionary against each tokens greater than a certain number (using java's String contains function). For example, I search for words like "ground" in wunderground for www.wunderground.com
I am sure there is a more efficient way of doing that. Any ideas?
Upvotes: 0
Views: 492
Reputation: 13357
If you load your dictionary into a HashMap, you can test each candidate substring ("wunderground", "underground", "nderground", "derground", ..., "wundergroun", ..., "under", ... "ground", ...) very quickly, specifically, in O(1) time.
To gauge efficiency: Figure out how many steps it has to do. We'll estimate its big-O complexity.
Your current algorithm has to loop through the entire dictionary: work proportional to the dictionary size, D entries). For each dictionary word, it calls contains()
: work proportional to the size of the URL word, C chars, minus the average dictionary word size, which we'll call 5. So that takes on the order of D * (C - 5) steps, O(D * (C - 5)), for each word in the URL.
After you build a hash table, the cost of a lookup is independent of the number of entries. Each URL term of C chars has C2 substrings. If you prune it to substrings of at least 5 chars, that's (C - 5)2 substrings. [Well, technically it's (C - 5) * (C - 4) / 2, but we're figuring asymptotic complexity, which is the big picture approximation.] So the cost to look them all up in the dictionary is (C - 5)2 steps. Again, that's for each word in the URL and independent of the dictionary size.
Let's say your dictionary has 10,000 entries and the average URL term is 10 chars long. Then the old algorithm takes 50,000 steps per URL term while the hash algorithm takes 25 steps per URL term. Make sense?
Upvotes: 1