Reputation: 11
Studying the deep neural networks, specifically the LSTM, I decided to follow the idea proposed in this link: Building Speech Dataset for LSTM binary classification to build a classifier.
I have an audio-based, where the features to extract MFCC, where each array is 13x56 each phoneme of a word. training data would be like this:
X = [[phon1fram[1][1], phon1fram[1][2],..., phon1fram[1][56]],
[phon1fram[2][1], phon1fram[2][2],..., phon1fram[2][56]], ....
[phon1fram[15][1], phon1fram[15][2], ..., phon1fram[15][56] ] ]
...
...
[[phon5fram[1][1], phon5fram[1][2],..., phon5fram[1][56]], ... ,
[phon5fram[15][1], phon5fram[15][2], ..., phon5fram[15][56]] ]
in lettering which is certainly the first frames labels would be said as "intermediaries" and only the last frame actually represent the phoneme?
Y = [[0, 0, ..., 0], #intermediary
[0, 0, ..., 0], ... , #intermediary
[1, 0, ..., 0]] # is one phoneme
[[0, 0, ..., 0], ... #intermediary
[0, 1, ..., 0] # other phoneme
This would be really correct? During the first tests I performed all my outlets expected tended to label this "middleman" for being the most prevalent. Any other approach could be used?
Upvotes: 1
Views: 2315
Reputation: 29
I am doing the same task. I am using http://keras.io/layers/recurrent/ to do the task.Use keras with theano backend to accomplish this task. You can follow these steps :
You can play around with the hyperparamters(batch_size,optimizer, loss function, sequnece size ) for evaluating results .
Upvotes: 1