Eduardo
Eduardo

Reputation: 7141

Optimising checking for Strings in a word list (Java)

I have a text file containing ~30,000 words in alphabetical order each on a separate line. I also have a Set<String> set containing ~10 words.

I want to check if any of the words in my set are in the word list (text file).

So far my method has been to:

  1. Open the word list text file
  2. Read a line/word
  3. Check if set contains that word
  4. Repeat to the end of the word list file

This seems badly optimised. For example if I'm checking a word in my set that begins with the letter b I see no point in checking words in the text file beggining with a & c, d, .. etc.

My proposed solution would be to separate the text file into 26 files, one file for words which start with each letter of the alphabet. Is there a more efficient solution than this?


Note: I know 30,000 words isn't that large a word list but I have to do this operation many times on a mobile device so performance is key.

Upvotes: 1

Views: 290

Answers (4)

nikhilsharmaNS
nikhilsharmaNS

Reputation: 71

You can further your approach of using Hash Sets onto the entire wordlist file. String comparisons are expensive so its better to create a HashSet of Integer. You should read the wordlist (assuming words will not increase from 30,000 to something like 3 million) once in its entirety and save all the words in an Integer Hashset. When adding into the Integer Hashset use:

wordListHashSet.add(mycurrentword.hashcode());

You have mentioned that you have a string hash of 10 words that must be checked if its in the wordlist. Again instead of String Hash, create an Integer Hash Set. Create an iterator of this Integer Hash Set.

Iterator it = myTenWordsHashSet.iterator();

Iterate over this in a loop and check for the following condition:

wordListHashSet.contains(it.next());

If this is true, then you have the word in the wordlist.

Using Integer Hash Maps is good idea when performance is what you are looking for. Internally Java processes the hash of each string and stores it in the memory such that repeated access to such strings is blazing fast, faster than binary search with search complexities of O(log n) to almost O(1) for each call to an element in the wordlist.

Hope that helps!

Upvotes: 2

Kostas Kryptos
Kostas Kryptos

Reputation: 4111

You can make some improvements depending on your needs.

If for example the file remains unchanged but your 10-words Set changes regularly, then you can load the file on another Set (HashSet). Now you just need to search for a match on this new Set. This way your search will always be O(1).

Upvotes: 0

AllTradesJack
AllTradesJack

Reputation: 2882

You might consider iterating through your 10 word set (maybe parse it from the file into an array), and for each entry, using a binary search algorithm to see if it's contained in the larger list. Binary search should only take O(logN), so in this case log(30,000) which is significantly faster that 30,000 steps.

Since you'll repeat this step once for every word in your set, it should take 10*log(30k)

Upvotes: 0

Erich Kitzmueller
Erich Kitzmueller

Reputation: 36987

It's probably not worth the hassle for 30,000 words, but let's just say you have a lot more, like say 300,000,000 words, and still only 10 words to look for.

In that case, you could do a binary search in the large file for each of the search words, using Random Access Files. Obviously, each searching step would require you to first to find the beginning of the word (or the next word, implementation dependend), which makes it a lot more difficult, and cutting out all the corner cases exceeds the limit of code one could provide here. But still it could be done and would surely be faster than reading through all of 300,000,000 words once.

Upvotes: 1

Related Questions