Reputation: 69
I am trying to make a model that is able to extract human speech from a recording. To do this I have loaded 1500 noisy files (some of these files are the exact same but with different speech to noise ratios (-1,1,3,5,7). I want my model to take in a wav file as a one dimensional array/tensor along the horizontal axis, and output a one dimensional array/tensor that I could then play.
currently this is how my data is set up.
an error I am having is that I am not able to make a prediction and when I am i get an array/tensor with only one element, instead one with 220500. The reason behind 22050 is that it is the length of the background noise that was overlapped into clean speech so every file is this length.
I have been messing around with layers.Input because while I want my model to take in every row as one "object"/audio clip. I dont know if that is what's happening because the only "successful" prediction is an error
Upvotes: 0
Views: 447
Reputation: 309
The model you built expect data in the format (batch_size, 1, 220500), as in the input layer you declared an input_shape of (1, 220500).
For the data you are using you should just use an input_shape of (220500,).
Another problem you might encounter, is that you are using a single unit in the last layer. This way the output of the model will be (batch_size, 1), but you need (batch_size, 220500) as an output.
For this last problem I suggest you to use a generative recurrent neural network.
Upvotes: 1