Given 100,000 word-to-phonemes mappings, how can I split the original words on the phoneme boundaries?

Question

I have a mapping of 100,000+ words to their phonemes (CMUdict), like:

ABANDONED => [ 'AH', 'B', 'AE', 'N', 'D', 'AH', 'N', 'D' ]

I want to split the original words' letters into a number of groups equal to the number of phonemes, e.x.

ABANDONED => [ 'A', 'B', 'A', 'N', 'D', 'O', 'N', 'ED' ]

I don't have a mapping of phonemes to graphemes, but it seems like I should be able to compute a statistical model of phonemes to graphemes, then use that to decide where to split each word. (It would be nice if the model could also be used to convert new words to their probable phonemes)

How can I do this? I was thinking a hidden Markov model sounds like it could be applicable, but beyond that hunch I don't know.

Given 100,000 word-to-phonemes mappings, how can I split the original words on the phoneme boundaries?

Answers (1)

Related Questions