Optimize Word Generator

Question

I am trying to build a program capable of finding the best word in a scrabble game. In the following code, I am trying to create a list of all the possible words given a set of 7 characters.

import csv
import itertools

with open('dictionary.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader)


def FindLegalWords(data):
    LegalWords = []
    for i in data:
        if len(i[0]) <= 15:
            LegalWords.append(i[0])
    return LegalWords


PossibleWords = []


def word_generator(chars, start_with, min_len, max_len):
    for i in range(min_len - 1, max_len):
        for s in itertools.product(chars, repeat=i):
            yield start_with + ''.join(s)


for word in word_generator('abcdefg', '', 2, 15):
    if word in FindLegalWords(data):
        PossibleWords.append(word)

I think it is clear that the aforementioned code will take days to find all the possible words. What would be a better approach to the problem? Personally, I thought of making each word a number and use NumPy to manipulate them because I have heard that NumPy is very quick. Would this solve the problem? Or it would not be enough? I will be happy to answer any questions that will arise about my code.

Thank you in advance

J&#233;r&#244;me Richard · Accepted Answer

There is about 5_539 billion possibilities and codes working with strings are generally pretty slow (partially due to Unicode and allocations). This is huge. Generating a massive amount of data to filter most of them is not efficient. This algorithmic problem cannot be fixed using optimized libraries like Numpy. One solution to solve this problem is to directly generate a much smaller subset of all possible values that still fit to FindLegalWords. I guess you probably do not want to generate words likes "bfddgfbgfgd". Thus, you can generate pronounceable words by concatenating 2 pronounceable word parts. Doing this is a bit tricky though. A much better solution is to retrieve the possible words from an existing dictionary. You can find such list online. There are also some dictionary of pronounceable words that can be retrieved from free passwords databases. AFAIK, some tools like John-the-Ripper can generate such list of word you can store in a text file and then read it from your Python program. Note that since the list can be huge, it is better to compress the file and read directly the file from a compressed source.

Some notes regarding the update:

Since FindLegalWords(data) is a constant, you can store it so not to recompute it over and over. You can even compute set(FindLegalWords(data)) so to search word faster in the result. Still, the number of possibility is the main problem so it will not be enough.

PossibleWords will contain all possible subsets of all strings in FindLegalWords(data). Thus, you can generate it directly from data rather than using a bruteforce approach combined with a check. This should be several order of magnitude faster is data is small. Otherwise, the main problem will be that PossibleWords will be so big that your RAM will certainly not big enough to contain it anyway...

Optimize Word Generator

Answers (1)

Related Questions