What is context dependent acoustic modeling?

Question

I'm trying to figure out what exactly context indepedent/dependent acoustic modeling is. I've been trying to read through some of the papers that address it but I'm still a little shaky with the concept. As I currently understand (which could be wrong) context dependent acoustic models are acoustic models trained on data where the phonemes occur in sequences. For example trained on a target language with words, so the phonemes are context dependent by the phonemes that occur before and after, giving them context. And independent context would be an acoustic model some how trained with just the phonemes in isolation.

Nikolay Shmyrev · Accepted Answer

The conventional approach is to recognize speech with a hidden Markov model (HMM). Basically in HMM you try to represent input sound as a sequence of states. Each state correspond to a certain part of the phoneme.

The difference is not on what the model is trained, but the structure of the model itself. Acoustic model is a set of detectors of sounds. Each detector describes what sound is alike, for example, it might be a Gaussian Mixture Model (GMM) which describes most probable values of phoneme features. Or it could be a neural network which detects specific sound.

In context-independent model the structure of hidden Markov model is simple, you detect all occurrences of phone with a single detector. Say you detect the word "hi" with the detectors for

 HH_begin HH_middle HH_end IY_begin IY_middle IY_end

And you detect word "hoy" with exactly same detectors for phone HH

 HH_begin HH_middle HH_end OY_begin OY_middle OY_end

In context-dependent model the detectors for HH in "hi" and "hoy" are different and trained separately. Basically they have different amount of parameters. This is reasonable because phones around do affect the pronunciation of the phone itself, the phone starts to sound a bit different. So you have

 HH_before_IY_begin HH_before_IY_middle 
     HH_before_IY_end IY_after_HH_begin 
        IY_after_HH_middle IY_after_HH_end

And for hoy

 HH_before_OY_begin HH_before_OY_middle 
     HH_before_OY_end OY_after_HH_begin 
        OY_after_HH_middle OY_after_HH_end

The advantage of this approach is that because you have more parameters you can recognize speech more accurately. The disadvantage is that you have to consider much much many variants instead.

Speech recognition algorithms are quite complex beyond what public web usually describes. For example, to reduce the amount of detectors context-dependent models are usually clustered and tied into some smaller set. Instead of hundreds of possible context dependent detectors you have just couple of thousands detectors merged to provide good discrimination and generalization.

If you are serious about speech recognition algorithms and practices instead of random sources on the web it is better to read a textbook like Spoken Language Processing or at least the paper The Application of Hidden Markov Models in Speech Recognition

What is context dependent acoustic modeling?

Answers (1)

Related Questions