16dots
16dots

Reputation: 3070

A more efficient way of finding English words that are one letter off from eachother

I wrote a little program that tries to find a connection between two equal length English words. Word A will transform into Word B by changing one letter at a time, each newly created word has to be an English word.

For example:

Word A = BANG
Word B = DUST

Result:

BANG -> BUNG ->BUNT -> DUNT -> DUST

My process:

  1. Load an English wordlist(consist of 109582 words) into a Map<Integer, List<String>> _wordMap = new HashMap();, key will be the word length.

  2. User put in 2 words.

  3. createGraph creates a graph.

  4. calculate the shortest path between those 2 nodes

  5. prints out the result.

Everything works perfectly fine, but I am not satisfied with the time it took in step 3.

See:

Completely loaded 109582 words!
CreateMap took: 30 milsecs
CreateGraph took: 17417 milsecs
(HOISE : HORSE)
(HOISE : POISE)
(POISE : PRISE)
(ARISE : PRISE)
(ANISE : ARISE)
(ANILE : ANISE)
(ANILE : ANKLE)
The wholething took: 17866 milsecs

I am not satisfied with the time it takes create the graph in step 3, here's my code for it(I am using JgraphT for the graph):

private List<String> _wordList = new ArrayList();  // list of all 109582 English words
private Map<Integer, List<String>> _wordMap = new HashMap();  // Map grouping all the words by their length()
private UndirectedGraph<String, DefaultEdge> _wordGraph =
        new SimpleGraph<String, DefaultEdge>(DefaultEdge.class);  // Graph used to calculate the shortest path from one node to the other.


private void createGraph(int wordLength){

    long before = System.currentTimeMillis();
    List<String> words = _wordMap.get(wordLength);
    for(String word:words){
        _wordGraph.addVertex(word);  // adds a node
        for(String wordToTest : _wordList){
            if (isSimilar(word, wordToTest)) {        
                _wordGraph.addVertex(wordToTest);  // adds another node
                _wordGraph.addEdge(word, wordToTest);  // connecting 2 nodes if they are one letter off from eachother
            }
        }            
    }        

    System.out.println("CreateGraph took: " + (System.currentTimeMillis() - before)+ " milsecs");
}


private boolean isSimilar(String wordA, String wordB) {
    if(wordA.length() != wordB.length()){
        return false;
    }        

    int matchingLetters = 0;
    if (wordA.equalsIgnoreCase(wordB)) {
        return false;
    }
    for (int i = 0; i < wordA.length(); i++) {

        if (wordA.charAt(i) == wordB.charAt(i)) {
            matchingLetters++;
        }
    }
    if (matchingLetters == wordA.length() - 1) {
        return true;
    }
    return false;
}

My question:

How can I improve my algorithm inorder to speed up the process?

For any redditors that are reading this, yes I created this after seeing the thread from /r/askreddit yesterday.

Upvotes: 9

Views: 2076

Answers (3)

Joop Eggen
Joop Eggen

Reputation: 109567

You can have the list of words of same length sorted, and then have a loop nesting of the kind for (int i = 0; i < n; ++i) for (int j = i + 1; j < n; ++j) { }.

And in isSimilar count the differences and on 2 return false.

Upvotes: 0

DNA
DNA

Reputation: 42607

You probably need to run it through a profiler to see where most of the time is taken, especially since you are using library classes - otherwise you might put in a lot of effort but see no significant improvement.

You could lowercase all the words before you start, to avoid the equalsIgnoreCase() on every comparison. In fact, this is an inconsistency in your code - you use equalsIgnoreCase() initially, but then compare chars in a case-sensitive way: if (wordA.charAt(i) == wordB.charAt(i)). It might be worth eliminating the equalsIgnoreCase() check entirely, since this is doing essentially the same thing as the following charAt loop.

You could change the comparison loop so it finishes early when it finds more than one different letter, rather than comparing all the letters and only then checking how many are matching or different.

(Update: this answer is about optimizing your current code. I realize, reading your question again, that you may be asking about alternative algorithms!)

Upvotes: 0

Jon Skeet
Jon Skeet

Reputation: 1500805

Here's a starting thought:

Create a Map<String, List<String>> (or a Multimap<String, String> if you've using Guava), and for each word, "blank out" one letter at a time, and add the original word to the list for that blanked out word. So you'd end up with:

.ORSE => NORSE, HORSE, GORSE (etc)
H.RSE => HORSE
HO.SE => HORSE, HOUSE (etc)

At that point, given a word, you can very easily find all the words it's similar to - just go through the same process again, but instead of adding to the map, just fetch all the values for each "blanked out" version.

Upvotes: 18

Related Questions