bizzz
bizzz

Reputation: 1865

Find all words and phrases from one string

Due to subject area (writing on a wall) interesting condition is added - letters cannot change their order, so this is not a question about anagrams.


I saw a long word, written by paint on a wall, and now suddenly I want all possible words and phrases I can get from this word by painting out any combination of letters. Wo r ds, randomly separated by whitespace are OK.
To broaden possible results let's make an assumption, that space is not necessary to separate words.
Edit: Obviously letter order should be maintained (thanks idz for pointing that out). Also, phrases may be meaningless. Here are some examples:

Source word: disestablishment 
paint out:   ^ ^^^    ^^^^ ^^
left:         i   tabl    e    -> i table

or paint out:^^^^^^^^^   ^ ^^
left:                 ish e    -> i she  (spacelessness is ok)

Visual example visual example Hard mode/bonus task: consider possible slight alterations to letters (D <-> B, C <-> O and so on) hard mode (D changed to B)

Please suggest your variants of solving this problem.

Here's my general straightforward approach

It's clear that we'll need an English dictionary to find words.
Our goal is to get words to search for in dictionary.
We need to find all possible letters variations to match them against dictionary: each letter can be itself (1) or painted out (0).
Taking the 'space is not needed to separate words' condition in consideration, to distinguish words we must assume that there might be a space between any two letters (1 - there's a space, 0 - there isn't).

d i s e s t a b l i s h m e n t
 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^  - possible whitespace

N = number of letters in source word
N-1 = number of 'might-be spaces'
Any of the N + N - 1 elements can be in two states, so let's treat them as booleans. The number of possible variations is 2^(N + N - 1). Yes, it counts useless variants like pasting a space between to spaces, but I didn't come up with more elegant formula.
Now we need an algorithm to get all possible variations of N+N-1 sequence of booleans (I haven't thought it out yet, but word recursion flows through my mind). Then substitute all 1s with corresponding letters (if index of boolean is odd) or whitespace (even)
and 0s with whitespace (odd) or nothing (even). Then trim leading and trailing whitespace, separate words and search them in dictionary.

I don't like this monstrous approach and hope you will help me find good alternatives.

Upvotes: 1

Views: 2749

Answers (3)

ninjagecko
ninjagecko

Reputation: 91094

#!/usr/bin/python3

from itertools import *
from pprint import pprint as pp

Read in dictionary, remove all 1- and 2-letter words which we never use in the English language:

with open('/usr/share/dict/words') as f:
    english = f.read().splitlines()

english = map(str.lower, english)
english = [w for w in english if (len(w)>2 or w in ['i','a','as','at','in','on','im','it','if','is','am','an'])]

def isWord(word):
    return word in english

Your problem:

def splitwords(word):
    """
        splitwords('starts') -> (('st', 'ar', 'ts'), ('st', 'arts'), ('star', 'ts'), ('starts'))
    """
    if word=='':
        yield ()
    for i in range(1,len(word)+1):
        try:
            left,right = word[:i],word[i:]
            if left in english:
                for reading in list(splitwords(right)):
                    yield (left,) + tuple(reading)
            else:
                raise IndexError()
        except IndexError:
            pass

def splitwordsWithDeletions(word):
    masks = product(*[(0,1) for char in word])
    for mask in masks:
        candidate = ''.join(compress(word,mask))
        for reading in splitwords(candidate):
            yield reading

for reading in splitwordsWithDeletions('interesting'):
    print(reading)

Result (takes about 30 seconds):

()                                                                                                                                                                                                                    
('i',)
('in',)
('tin',)
('ting',)
('sin',)
('sing',)
('sting',)
('eng',)
('rig',)
('ring',)
('rein',)
('resin',)
('rest',)
('rest', 'i')
('rest', 'in')
...
('inters', 'tin')
('inter', 'sting')
('inters', 'ting')
('inter', 'eng')
('interest',)
('interest', 'i')
('interest', 'in')
('interesting',)

Speedup possible perhaps by precalculating which words can be read on each letter, into one bin per letter, and iterating with those pre-calculated to speed things up. I think someone else outlines a solution to that effect.

Upvotes: 1

idz
idz

Reputation: 12988

1) Put your dictionary in a trie or prefix tree

2) For each position in the string find legal words by trie look up; store these

3) Print all combinations of non-overlapping words

This assumes that like the examples in the question you want to maintain the letter order (i.e. you are not interested in anagrams).

Upvotes: 2

ccoakley
ccoakley

Reputation: 3255

There are other places you can find anagram algorithms.

subwords(word):
  if word is empty return
  if word is real word:
    print word
  anagrams(word)
  for each letter in word:
    subwords(word minus letter)

Edit: shoot, you'll want to pass a starting point in for the for loop. Otherwise, you'll be redundantly creating a LOT of calls. Frank minus r minus n is the same as Frank minus n minus r. Putting a starting point can ensure that you get each subset once... Except for repeats due to double letters. Maybe just memoize the results to a hash table before printing? Argh...

Upvotes: 0

Related Questions