Reputation: 2858
I have a mapping of 100,000+ words to their phonemes (CMUdict), like:
ABANDONED => [ 'AH', 'B', 'AE', 'N', 'D', 'AH', 'N', 'D' ]
I want to split the original words' letters into a number of groups equal to the number of phonemes, e.x.
ABANDONED => [ 'A', 'B', 'A', 'N', 'D', 'O', 'N', 'ED' ]
I don't have a mapping of phonemes to graphemes, but it seems like I should be able to compute a statistical model of phonemes to graphemes, then use that to decide where to split each word. (It would be nice if the model could also be used to convert new words to their probable phonemes)
How can I do this? I was thinking a hidden Markov model sounds like it could be applicable, but beyond that hunch I don't know.
Upvotes: 4
Views: 1307
Reputation: 9451
To gather statistics, first align the word to its phonetic representation by matching the identical letters and phonemes (like N
and N
). You can get the best match with dynamic programming. Then you can map the remaining characters of the words to the remaining phonemes.
Once you calculate the frequencies, you can use the noisy channel model to convert new words to phonemes.
Upvotes: 1