Reputation: 32400
Start with this:
[G|C] * [T] *
Write a program that generates this:
Cat
Cut
Cute
City <-- NOTE: this one is wrong, because City has an "ESS" sound at the start.
Caught
...
Gate
Gotti
Gut
...
Kit
Kite
Kate
Kata
Katie
Another Example, This:
[C] * [T] * [N]
Should produce this:
Cotton Kitten
Where should I start my research as I figure out how to write a program/script that does this?
Upvotes: 4
Views: 1848
Reputation: 3851
You can somewhat do this using the steps I outline. I will outline the algorithm first followed by some (untested and quite possibly broken) java code.
Note: I will be using the apache commons-codec library.
To illustrate how steps 3 and 4 work, I will first show you the output of the Double Metaphone algorithm on the five words you have suggested as examples: Cute, Cat, Cut, Caught, City
private static void doubleMetaphoneTest() {
org.apache.commons.codec.language.DoubleMetaphone dm = new DoubleMetaphone();
System.out.println("Cute\t"+dm.encode("Cute"));
System.out.println("Cat\t"+dm.encode("Cat"));
System.out.println("Cut\t"+dm.encode("Cut"));
System.out.println("Caught\t"+dm.encode("Caught"));
System.out.println("City\t"+dm.encode("City"));
}
Cute KT
Cat KT
Cut KT
Caught KFT
City ST
Now in your question, you have stated that City is not a right solution because it begins with an "ESS" sound. Double Metaphone will help you to identify exactly this kind of issue (although I am sure there will be cases where it will fail to help). Now you can apply step 4 in the algorithm using this principle.
In the following code, for step 4 (apply some phonetic filtering), I will assume that you already know that you only want the 'K' sound and not the 'S' sound.
Note: This code is meant to illustrate the use of the DoubleMetaphone algorithm for your purpose. I haven't run the code. The regex may be broken or may be a really lame one or my use of Pattern Matcher may be wrong (It's 2AM now). If it is wrong please improve/correct it.
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.codec.language.DoubleMetaphone;
public class GenerateWords {
/**
* Returns a set of words that conform to the input pattern
* @param inputPattern a regular expression
* @param lexicon a list of valid words
*/
public static List<String> fetchMatchingWordsFromLexicon(String inputPattern, List<String> lexicon){
/* E.g. for the case [C] * [T] * [N]
* the regex is:
* [Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+
*/
Pattern p = Pattern.compile(inputPattern);
List<String> result = new ArrayList<String>();
for(String aWord:lexicon){
Matcher m = p.matcher(aWord);
if(m.matches()){
result.add(aWord);
}
}
return result;
}
/**
* Returns the subset of the input list that "phonetically" begins with the character specified.
* E.g. The word 'cat' begins with 'K' and the word 'city' begins with 'S'
* @param prefix
* @param possibleWords
* @return
*/
public static List<String> filterWordsBeginningWithMetaphonePrefix(char prefix, List<String> possibleWords){
List<String> result = new ArrayList<String>();
DoubleMetaphone dm = new DoubleMetaphone();
for(String aWord:possibleWords){
String phoneticRepresentation = dm.encode(aWord); // this will always return in all caps
// check if the word begins with the prefix char of interest
if(phoneticRepresentation.indexOf(0)==Character.toUpperCase(prefix)){
result.add(aWord);
}
}
return result;
}
public static void main(String args[]){
// I have not implemented this method to read a text file etc.
List<String> lexicon = readLexiconFromFileIntoList();
String regex = "[Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+";
List<String> possibleWords = fetchMatchingWordsFromLexicon(regex,lexicon);
// your result
List<String> result = filterWordsBeginningWithMetaphonePrefix('C', possibleWords);
// print result or whatever
}
}
Upvotes: 1
Reputation: 3123
You can do this by using regular expressions against a dictionary containing phonetic versions of words.
Here's an example in Javascript:
<html>
<head>
<title>Test</title>
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script>
<script>
$.get('cmudict0.3',function (data) {
matches = data.match(/^(\S*)\s+K.*\sT.*\sN$/mg);
$('body').html('<p>'+matches.join('<br/> ')+'</p>');
})
</script>
</head>
<body>
</body>
</html>
You'll need to download the list of all words from http://icon.shef.ac.uk/Moby/mpron.tar.Z and put it (uncompressed) in the same folder as the HTML file. I've only translated the [C] * [T] * [N] version into a regular expression and the output isn't very nice but it'll give you the idea. Here's a sample of the output:
CALTON K AE1 L T AH0 N
CAMPTON K AE1 M P T AH0 N
CANTEEN K AE0 N T IY1 N
CANTIN K AA0 N T IY1 N
CANTLIN K AE1 N T L IH0 N
CANTLON K AE1 N T L AH0 N
...
COTTERMAN K AA1 T ER0 M AH0 N
COTTMAN K AA1 T M AH0 N
COTTON K AA1 T AH0 N
COTTON(2) K AO1 T AH0 N
COULSTON K AW1 L S T AH0 N
COUNTDOWN K AW1 N T D AW2 N
..
KITSON K IH1 T S AH0 N
KITTELSON K IH1 T IH0 L S AH0 N
KITTEN K IH1 T AH0 N
KITTERMAN K IH1 T ER0 M AH0 N
KITTLESON K IH1 T L IH0 S AH0 N
...
Upvotes: 4
Reputation: 1469
One approach would be to transform an English pronounciation dictionary into a finite state machine and then search it using a regular expression or a simple wildcard mechanism. You could also compile a such a dictionary yourself by running an English word list through a program that produces phonetic transcriptions, e.g. like the ones found on these sites:
Finding a mechanism to map back from the phonetic transcription to standard spelling should be easy.
Upvotes: 2
Reputation: 1543
You want moby pronounciation. It's part of the moby words project.
You'll find an explanation and links to the documents here: http://en.wikipedia.org/wiki/Moby_Project
Moby pronounciation is a list of about 170k words and their phonetic pronounciations.
From there it should be a relatively straight forward process to build the program.
Upvotes: 2
Reputation: 47082
You need a word list or dictionary that uses something like the International Phonetic Alphabet, or some other standard phonetic way of writing words. It would need to have a list of English words and their corresponding phonetic spellings. I have no idea where you would get one because I don't think that the standard dictionary makers just hand out that sort of information.
Upvotes: 4
Reputation: 83270
A phoneme is "the smallest segmental unit of sound employed to form meaningful contrasts between utterances." It is my understanding that this is the basis for pronunciation-based spelling correction systems. Misspelling newspaper as noospaypr might generate the proper correction despite a large edit distance between the two words, because the corresponding segments in each word (oo and ew, pa and pay, per and pr) may be converted to the same phoneme.
Unfortunately, a couple of minutes of me Googling didn't find any libraries that will perform the conversion for english words, but that is where I would start.
Upvotes: 1