Reputation: 11
due to a lack of understanding of how audio works, I have a question. What is meant in the text below? Is it required that the length of each audio is divisible by 5 without a remainder, or what?
Step 2: Transfer Spectrogram to Array: each row in the array represents the frequency level and the column is the time frame. The value in the array represents the amplitude. In order to avoid two songs are concatenated into one segment, silence with values of 0 is padded at the end of the song based on the size of the audio.
Step b) The input is sliced into small time window (segment) based on defined segment length (5 sec is used in this project). It means that spectrogram frames(Xt) are sliced into 5 sec segment and are sequentially used as input features to feed into the hidden layer together with previous time step’s outputs (ht-1).
Here is the link to the full article. https://towardsdatascience.com/practical-introduction-to-automation-music-transcription-3ad8ad40eab6
my example of the current code. As I understand it.
`
num_cols_old = spectrogram_array.shape[1]
if num_cols_old % num_cols != 0:
num_cols_new = int(num_cols_old / num_cols) * num_cols + num_cols
spectrogram_array = np.pad(spectrogram_array, ((0, 0), (0, num_cols_new - num_cols_old)), mode='constant')
Upvotes: 1
Views: 34