lmqsantana
lmqsantana

Reputation: 11

Speech recognition with LSTM with features extracted in MFCC

Studying the deep neural networks, specifically the LSTM, I decided to follow the idea proposed in this link: Building Speech Dataset for LSTM binary classification to build a classifier.

I have an audio-based, where the features to extract MFCC, where each array is 13x56 each phoneme of a word. training data would be like this:

X = [[phon1fram[1][1], phon1fram[1][2],..., phon1fram[1][56]], 
     [phon1fram[2][1], phon1fram[2][2],..., phon1fram[2][56]], ....   
     [phon1fram[15][1], phon1fram[15][2], ..., phon1fram[15][56] ] ]
     ...
     ...
     [[phon5fram[1][1], phon5fram[1][2],..., phon5fram[1][56]], ... ,
     [phon5fram[15][1], phon5fram[15][2], ..., phon5fram[15][56]] ]

in lettering which is certainly the first frames labels would be said as "intermediaries" and only the last frame actually represent the phoneme?

Y = [[0, 0, ..., 0],        #intermediary
     [0, 0, ..., 0], ... ,  #intermediary
     [1, 0, ..., 0]]        # is one phoneme
    [[0, 0, ..., 0], ...    #intermediary
     [0, 1, ..., 0]         # other phoneme

This would be really correct? During the first tests I performed all my outlets expected tended to label this "middleman" for being the most prevalent. Any other approach could be used?

Upvotes: 1

Views: 2315

Answers (1)

Kunal saxena
Kunal saxena

Reputation: 29

I am doing the same task. I am using http://keras.io/layers/recurrent/ to do the task.Use keras with theano backend to accomplish this task. You can follow these steps :

  1. Store Mfcc values in a TXT file.
  2. Read TXT file and store all the values to a Numpy array.
  3. Pass this numpy array to the input of your neural net.
  4. Apply padding before feeding the input

You can play around with the hyperparamters(batch_size,optimizer, loss function, sequnece size ) for evaluating results .

Upvotes: 1

Related Questions