Joe
Joe

Reputation: 405

Mel Spectrogram feature extraction to CNN

This question is in line with the question posted here but with a slight nuance of the CNN. Using the feature extraction definition:

max_pad_len = 174
n_mels = 128

def extract_features(file_name):
    try:
        audio, sample_rate = librosa.core.load(file_name, res_type='kaiser_fast')
        mely = librosa.feature.melspectrogram(y=audio, sr=sample_rate, n_mels=n_mels)
        #pad_width = max_pad_len - mely.shape[1]
        #mely = np.pad(mely, pad_width=((0, 0), (0, pad_width)), mode='constant')

    except Exception as e:
        print("Error encountered while parsing file: ", file_name)
        return None

    return mely

How do you go about getting the correct dimension of the num_rows, num_columns and num_channels to be input to the train and test data?

In constructing the CNN Model, how to determine the correct shape to input?

model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

Upvotes: 0

Views: 1734

Answers (1)

Enzo Conejero
Enzo Conejero

Reputation: 1

I dont know if it is exactly your problem but I also have to use a MEL as an input to a CNN.

Short answer:

input_shape = (x_train.shape[1], x_train.shape[2], 1)
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)

or

x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
input_shape = x_train.shape[1:]

Long answer

In my case I have a DataFrame with speakers_id and mel spectrograms (previously calculated with librosa).

The Keras CNN models are prepared for images with width, height and channels of colors (grayscale - RGB)

The Mel Spectrograms given by librosa are image-like arrays with width and height, so you need to do a reshape to add the channel dimension.

  1. Define the input and expected output
# It looks stupid but that way i could convert the panda.Series to a np.array
x = np.array(list(df.mel)) 
y = df.speaker_id
print('X shape:', x.shape)

X shape: (2204, 128, 24)
2204 Mels, 128x24

  1. Split in train-test
x_train, x_test, y_train, y_test = train_test_split(x, y)
print(f'Train: {len(x_train)}', f'Test: {len(x_test)}')

Train: 1653 Test: 551

  1. Reshape to add the extra dimension
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
x_test = x_test.reshape(x_test.shape[0], x_test.shape[1], x_test.shape[2], 1)
print('Shapes:', x_train.shape, x_test.shape)

Shapes: (1653, 128, 24, 1) (551, 128, 24, 1)

  1. Set input_shape
# The input shape is independent of the amount of inputs
input_shape = x_train.shape[1:]
print('Input shape:', input_shape)

Input shape: (128, 24, 1)

  1. Put it into the model
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D())
# More layers...
model.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),metrics=['accuracy'])
  1. Run model
model.fit(x_train, y_train, epochs=20, validation_data=(x_test, y_test))

Hope this is helpfull

Upvotes: 0

Related Questions