Reputation: 32400

Generate a list of English words containing consecutive consonant sounds

Start with this:

[G|C] * [T] *

Write a program that generates this:

Cat
Cut
Cute
City <-- NOTE: this one is wrong, because City has an "ESS" sound at the start.
Caught
...
Gate
Gotti
Gut
...
Kit
Kite
Kate
Kata
Katie

Another Example, This:

[C] * [T] * [N]

Should produce this:

Cotton Kitten

Where should I start my research as I figure out how to write a program/script that does this?

Upvotes: 4

Answers (6)

hashable

Reputation: 3851

You can somewhat do this using the steps I outline. I will outline the algorithm first followed by some (untested and quite possibly broken) java code.

Note: I will be using the apache commons-codec library.

Algorithm:

Use a regular expression to represent your input pattern.
From a lexicon of "valid known words" filter the subset that matches your regular expression. Let's call this the matched subset (MS)
Use the Double Metaphone algorithm to encode these words from MS.
Apply some phonetic filtering to prune MS to your needs.

To illustrate how steps 3 and 4 work, I will first show you the output of the Double Metaphone algorithm on the five words you have suggested as examples: Cute, Cat, Cut, Caught, City

Code A (illustrating Double Metaphone):

private static void doubleMetaphoneTest() {
    org.apache.commons.codec.language.DoubleMetaphone dm = new DoubleMetaphone();
    System.out.println("Cute\t"+dm.encode("Cute"));
    System.out.println("Cat\t"+dm.encode("Cat"));
    System.out.println("Cut\t"+dm.encode("Cut"));
    System.out.println("Caught\t"+dm.encode("Caught"));
    System.out.println("City\t"+dm.encode("City"));
}

Output of code A

Cute   KT
Cat    KT
Cut    KT
Caught KFT
City   ST

Now in your question, you have stated that City is not a right solution because it begins with an "ESS" sound. Double Metaphone will help you to identify exactly this kind of issue (although I am sure there will be cases where it will fail to help). Now you can apply step 4 in the algorithm using this principle.

In the following code, for step 4 (apply some phonetic filtering), I will assume that you already know that you only want the 'K' sound and not the 'S' sound.

Code B (prototype solution to entire question)

Note: This code is meant to illustrate the use of the DoubleMetaphone algorithm for your purpose. I haven't run the code. The regex may be broken or may be a really lame one or my use of Pattern Matcher may be wrong (It's 2AM now). If it is wrong please improve/correct it.

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.codec.language.DoubleMetaphone;

public class GenerateWords {

    /**
     * Returns a set of words that conform to the input pattern
     * @param inputPattern a regular expression
     * @param lexicon a list of valid words
     */
    public static List<String> fetchMatchingWordsFromLexicon(String inputPattern, List<String> lexicon){
        /* E.g. for the case           [C] * [T] * [N] 
         * the regex is:
         *  [Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+
         */
        Pattern p = Pattern.compile(inputPattern);
        List<String> result = new ArrayList<String>();
        for(String aWord:lexicon){
            Matcher m = p.matcher(aWord);
            if(m.matches()){
                result.add(aWord);
            }
        }
        return result; 
    }

    /**
     * Returns the subset of the input list that "phonetically" begins with the character specified.
     * E.g. The word 'cat' begins with 'K' and the word 'city' begins with 'S'
     * @param prefix
     * @param possibleWords
     * @return
     */
    public static List<String> filterWordsBeginningWithMetaphonePrefix(char prefix, List<String> possibleWords){
        List<String> result = new ArrayList<String>();
        DoubleMetaphone dm = new DoubleMetaphone();
        for(String aWord:possibleWords){
            String phoneticRepresentation = dm.encode(aWord); // this will always return in all caps
            // check if the word begins with the prefix char of interest
            if(phoneticRepresentation.indexOf(0)==Character.toUpperCase(prefix)){
                result.add(aWord);
            }
        }
        return result;
    }

    public static void main(String args[]){

        // I have not implemented this method to read a text file etc.
        List<String> lexicon = readLexiconFromFileIntoList();
        String regex = "[Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+";
        List<String> possibleWords = fetchMatchingWordsFromLexicon(regex,lexicon);

        // your result
        List<String> result = filterWordsBeginningWithMetaphonePrefix('C', possibleWords);

        // print result or whatever

    }
}

Upvotes: 1

Rich

Reputation: 3123

You can do this by using regular expressions against a dictionary containing phonetic versions of words.

Here's an example in Javascript:

     <html>
    <head>
        <title>Test</title>
        <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script>
        <script>

            $.get('cmudict0.3',function (data) {
                matches = data.match(/^(\S*)\s+K.*\sT.*\sN$/mg);
                $('body').html('<p>'+matches.join('<br/> ')+'</p>');
            })

        </script>
    </head>
    <body>
    </body>
</html>

You'll need to download the list of all words from http://icon.shef.ac.uk/Moby/mpron.tar.Z and put it (uncompressed) in the same folder as the HTML file. I've only translated the [C] * [T] * [N] version into a regular expression and the output isn't very nice but it'll give you the idea. Here's a sample of the output:

CALTON K AE1 L T AH0 N
CAMPTON K AE1 M P T AH0 N
CANTEEN K AE0 N T IY1 N
CANTIN K AA0 N T IY1 N
CANTLIN K AE1 N T L IH0 N
CANTLON K AE1 N T L AH0 N
...
COTTERMAN K AA1 T ER0 M AH0 N
COTTMAN K AA1 T M AH0 N
COTTON K AA1 T AH0 N
COTTON(2) K AO1 T AH0 N
COULSTON K AW1 L S T AH0 N
COUNTDOWN K AW1 N T D AW2 N
..
KITSON K IH1 T S AH0 N
KITTELSON K IH1 T IH0 L S AH0 N
KITTEN K IH1 T AH0 N
KITTERMAN K IH1 T ER0 M AH0 N
KITTLESON K IH1 T L IH0 S AH0 N
...

Upvotes: 4

ferdystschenko

Reputation: 1469

One approach would be to transform an English pronounciation dictionary into a finite state machine and then search it using a regular expression or a simple wildcard mechanism. You could also compile a such a dictionary yourself by running an English word list through a program that produces phonetic transcriptions, e.g. like the ones found on these sites:

Finding a mechanism to map back from the phonetic transcription to standard spelling should be easy.

Upvotes: 2

ablerman

Reputation: 1543

You want moby pronounciation. It's part of the moby words project.

You'll find an explanation and links to the documents here: http://en.wikipedia.org/wiki/Moby_Project

Moby pronounciation is a list of about 170k words and their phonetic pronounciations.

From there it should be a relatively straight forward process to build the program.

Upvotes: 2

Justin Peel

Reputation: 47082

You need a word list or dictionary that uses something like the International Phonetic Alphabet, or some other standard phonetic way of writing words. It would need to have a list of English words and their corresponding phonetic spellings. I have no idea where you would get one because I don't think that the standard dictionary makers just hand out that sort of information.

Upvotes: 4

danben

Reputation: 83270

A phoneme is "the smallest segmental unit of sound employed to form meaningful contrasts between utterances." It is my understanding that this is the basis for pronunciation-based spelling correction systems. Misspelling newspaper as noospaypr might generate the proper correction despite a large edit distance between the two words, because the corresponding segments in each word (oo and ew, pa and pay, per and pr) may be converted to the same phoneme.

Unfortunately, a couple of minutes of me Googling didn't find any libraries that will perform the conversion for english words, but that is where I would start.

Upvotes: 1

Generate a list of English words containing consecutive consonant sounds

Answers (6)

Algorithm:

Code A (illustrating Double Metaphone):

Output of code A

Code B (prototype solution to entire question)

Related Questions