Speech recognition with LSTM with features extracted in MFCC

Question

Studying the deep neural networks, specifically the LSTM, I decided to follow the idea proposed in this link: Building Speech Dataset for LSTM binary classification to build a classifier.

I have an audio-based, where the features to extract MFCC, where each array is 13x56 each phoneme of a word. training data would be like this:

X = [[phon1fram[1][1], phon1fram[1][2],..., phon1fram[1][56]], 
     [phon1fram[2][1], phon1fram[2][2],..., phon1fram[2][56]], ....   
     [phon1fram[15][1], phon1fram[15][2], ..., phon1fram[15][56] ] ]
     ...
     ...
     [[phon5fram[1][1], phon5fram[1][2],..., phon5fram[1][56]], ... ,
     [phon5fram[15][1], phon5fram[15][2], ..., phon5fram[15][56]] ]

in lettering which is certainly the first frames labels would be said as "intermediaries" and only the last frame actually represent the phoneme?

Y = [[0, 0, ..., 0],        #intermediary
     [0, 0, ..., 0], ... ,  #intermediary
     [1, 0, ..., 0]]        # is one phoneme
    [[0, 0, ..., 0], ...    #intermediary
     [0, 1, ..., 0]         # other phoneme

This would be really correct? During the first tests I performed all my outlets expected tended to label this "middleman" for being the most prevalent. Any other approach could be used?

Speech recognition with LSTM with features extracted in MFCC

Answers (1)

Related Questions