Richard
Richard

Reputation: 968

How to create a Boggle Board from a list of Words? (reverse Boggle solver!)

I am trying to solve the reverse Boggle problem. Simply put, given a list of words, come up with a 4x4 grid of letters in which as many words in the list can be found in sequences of adjacent letters (letters are adjacent both orthogonally and diagonally).

I DO NOT want to take a known board and solve it. That is an easy TRIE problem and has been discussed/solved to death here for people's CS projects.

Example word list:

margays, jaguars, cougars, tomcats, margay, jaguar, cougar, pumas, puma, toms

Solution:

ATJY
CTSA
OMGS
PUAR

This problem is HARD (for me). Algorithm I have so far:

  1. For each word in the input, make a list of all possible ways it can legally be appear on the board by itself.
  2. Try all possible combinations of placing word #2 on those boards and keep the ones that have no conflicts.
  3. Repeat till end of list.
  4. ...
  5. Profit!!! (for those that read /.)

Obviously, there are implementation details. Start with the longest word first. Ignore words that are substrings of other words.

I can generate all 68k possible boards for a 7 character word in around 0.4 seconds. Then when I add an additional 7 character board, I need to compare 68k x 68k boards x 7 comparisons. Solve time becomes glacial.

There must be a better way to do this!!!!

Some code:

BOARD_SIDE_LENGTH = 4

class Board:
    def __init__(self):
        pass

    def setup(self, word, start_position):
        self.word = word
        self.indexSequence = [start_position,]
        self.letters_left_over = word[1:]
        self.overlay = []
        # set up template for overlay.  When we compare boards, we will add to this if the board fits
        for i in range(BOARD_SIDE_LENGTH*BOARD_SIDE_LENGTH):
            self.overlay.append('')
        self.overlay[start_position] = word[0]
        self.overlay_count = 0

    @classmethod
    def copy(boardClass, board):
        newBoard = boardClass()
        newBoard.word = board.word
        newBoard.indexSequence = board.indexSequence[:]
        newBoard.letters_left_over = board.letters_left_over
        newBoard.overlay = board.overlay[:]
        newBoard.overlay_count = board.overlay_count
        return newBoard

    # need to check if otherboard will fit into existing board (allowed to use blank spaces!)
    # otherBoard will always be just a single word
    @classmethod
    def testOverlay(self, this_board, otherBoard):
        for pos in otherBoard.indexSequence:
            this_board_letter = this_board.overlay[pos]
            other_board_letter = otherBoard.overlay[pos]
            if this_board_letter == '' or other_board_letter == '':
                continue
            elif this_board_letter == other_board_letter:
                continue
            else:
                return False
        return True

    @classmethod
    def doOverlay(self, this_board, otherBoard):
        # otherBoard will always be just a single word
        for pos in otherBoard.indexSequence:
            this_board.overlay[pos] = otherBoard.overlay[pos]
        this_board.overlay_count = this_board.overlay_count + 1

    @classmethod
    def newFromBoard(boardClass, board, next_position):
        newBoard = boardClass()
        newBoard.indexSequence = board.indexSequence + [next_position]
        newBoard.word = board.word
        newBoard.overlay = board.overlay[:]
        newBoard.overlay[next_position] = board.letters_left_over[0]    
        newBoard.letters_left_over = board.letters_left_over[1:]
        newBoard.overlay_count = board.overlay_count
        return newBoard

    def getValidCoordinates(self, board, position):
        row = position / 4
        column = position % 4
        for r in range(row - 1, row + 2):
            for c in range(column - 1, column + 2):
                if r >= 0 and r < BOARD_SIDE_LENGTH and c >= 0 and c < BOARD_SIDE_LENGTH:
                    if (r*BOARD_SIDE_LENGTH+c not in board.indexSequence):
                        yield r, c

class boardgen:
    def __init__(self):
        self.boards = []

    def createAll(self, board):
        # get the next letter
        if len(board.letters_left_over) == 0:
            self.boards.append(board)
            return
        next_letter = board.letters_left_over[0]    
        last_position = board.indexSequence[-1]
        for row, column in board.getValidCoordinates(board, last_position):
            new_board = Board.newFromBoard(board, row*BOARD_SIDE_LENGTH+column)
            self.createAll(new_board)

And use it like this:

words = ['margays', 'jaguars', 'cougars', 'tomcats', 'margay', 'jaguar', 'cougar', 'pumas', 'puma']
words.sort(key=len)

first_word = words.pop()

# generate all boards for the first word
overlaid_boards = []
for i in range(BOARD_SIDE_LENGTH*BOARD_SIDE_LENGTH):
    test_board = Board()
    test_board.setup(first_word, i)
    generator = boardgen()
    generator.createAll(test_board)
    overlaid_boards += generator.boards

Upvotes: 6

Views: 5908

Answers (2)

Chris J. Kiick
Chris J. Kiick

Reputation: 1505

This is an interesting problem. I can't quite come up with a full, optimized solution, but there here are some ideas you might try.

The hard part is the requirement to find the optimal subset if you can't fit all the words in. That's going to add a lot to the complexity. Start by eliminating word combinations that obviously aren't going to work. Cut any words with >16 letters. Count the number of unique letters needed. Be sure to take into account letters repeated in the same word. For example, if the list includes "eagle" I don't think you are allowed to use the same 'e' for both ends of the word. If your list of needed letters is >16, you have to drop some words. Figuring out which ones to cut first is an interesting sub-problem... I'd start with the words containing the least used letters. It might help to have all sub-lists sorted by score.

Then you can do the trivial cases where the total of word lengths is <16. After that, you start with the full list of words and see if there's a solution for that. If not, figure out which word(s) to drop and try again.

Given a word list then, the core algorithm is to find a grid (if one exists) that contains all of those words.

The dumb brute-force way would be to iterate over all the grids possible with the letters you need, and test each one to see if all your words fit. It's pretty harsh though: middle case is 16! = 2x10exp13 boards. Exact formula for n unique letters is... (16!)/(16-n)! x pow(n, 16-n). Which gives a worst case in the range of 3x10exp16. Not very manageable. Even if you can avoid rotations and flips, that only saves you 1/16 of the search space.

A somewhat smarter greedy algorithm would be to sort the words by some criteria, like difficulty or length. A recursive solution would be to take the top word remaining on the list, and attempt to place it on the grid. Then recurse with that grid and the remaining word list. If you fill up the grid before you run out of words, then you have to back track and try another way of placing the word. A greedier approach would be to try placements that re-use the most letters first. You can do some pruning. If at any point the number of spaces left in the grid is less than the remaining set of unique letters needed, then you can eliminate those sub-trees. There are a few other cases where it's obvious there's no solution that can be cut, especially when the remaining grid spaces are < the length of the last word. The search space for this depends on word lengths and how many letters are re-used. I'm sure it's better than brute-force, but I don't know if it's enough to make the problem reasonable.

The smart way would be to use some form of dynamic programming. I can't quite see the complete algorithm for this. One idea is to have a tree or graph of the letters, connecting each letter to "adjacent" letters in the word list. Then you start with the most-connected letter and try to map the tree onto the grid. Always place the letter that completes the most of the word list. There'd have to be some way of handling the case of multiple of the same letter in the grid. And I'm not sure how to order it so you don't have to search every combination.

The best thing would be to have a dynamic algorithm that also included all the sub word lists. So if the list had "fog" and "fox", and fox doesn't fit but fog does, it would be able to handle that without having to run the whole thing on both versions of the list. That's adding complexity because now you have to rank each solution by score as you go. But in the cases where all the words won't fit it would save a lot of time.

Good luck on this.

Upvotes: 2

mcdowella
mcdowella

Reputation: 19601

There are a couple of general ideas for speeding up backtrack search you could try:

1) Early checks. It usually helps to discard partial solutions that can never work as early as possible, even at the cost of more work. Consider all two-character sequences produced by chopping up the words you are trying to fit in - e.g. PUMAS contributes PU, UM, MA, and AS. These must all be present in the final answer. If a partial solution does not have enough overlapped two-character spaces free to contain all of the overlapped two-character sequences it does not yet have, then it cannot be extended to a final answer.

2) Symmetries. I think this is probably most useful if you are trying to prove that there is no solution. Given one way of filling in a board, you can rotate and reflect that solution to find other ways of filling in a board. If you have 68K starting points and one starting point is a rotation or reflection of another starting point, you don't need to try both, because if you can (or could) solve the problem from one starting point you can get the answer from the other starting point by rotating or reflecting the board. So you might be able to divide the number of starting points you need to try by some integer.

This problem is not the only one to have a large number of alternatives at each stage. This also affects the traveling salesman problem. If you can accept not having a guarantee that you will find the absolute best answer, you could try not following up the least promising of these 68k choices. You need some sort of score to decide which to keep - you might wish to keep those which use as many letters already in place as possible. Some programs for the traveling salesman problems discard unpromising links between nodes very early. A more general approach which discards partial solutions rather than doing a full depth first search or branch and bound is Limited Discrepancy Search - see for example http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2426.

Of course some approaches to the TSP discard tree search completely in favor of some sort of hill-climbing approach. You might start off with a filled boggle square and repeatedly attempt to find your words in it, modifying a few characters in order to force them in, trying to find steps which successively increase the number of words that can be found in the square. The easiest form of hill-climbing is repeated simple hill-climbing from multiple random starts. Another approach is to restart the hill-climbing by randomizing only a portion of the solution so far - since you don't know the best size of portion to randomize you might decide to chose the size of portion to randomize at random, so that at least some fraction of the time you are randomizing the correct size of region to produce a new square to start from. Genetic algorithms and simulated annealing are very popular here. A paper on a new idea, Late Acceptance Hill-Climbing, also describes some of its competitors - http://www.cs.nott.ac.uk/~yxb/LAHC/LAHC-TR.pdf

Upvotes: 0

Related Questions