Debayan Ghosh
Debayan Ghosh

Reputation: 11

How to apply MFCC Coefficients to DTW

I am trying to implement a Speech Recognition module using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW).

I divide the signal(x(n)) into frames with of 25ms with overlap of 10ms and find the MFCC parameters for each frame. My main doubt is how do i perform DTW in this scenario. Suppose there are M frames, and N(13) MFCC coefficients.

So I have a M x N matrix. Now how am I supposed to compute DTW?

Upvotes: 1

Views: 1308

Answers (2)

Radmar
Radmar

Reputation: 73

The use of DTW suppose to verify 2 audio sequences in your case. Thus, for the sequence to be verify you will have a matrix M1xN and for the query M2xN. This implies that your cost matrix will have M1xM2.

To construct the cost matrix you have to apply a distance/cost measure between the sequences, as cost(i,j) = your_chosen_multidimension_metric(M1[i,:],M2[j,:])

The resulted cost matrix will be 2D, and you could apply easily DTW.

I made a similar code for DTW based on MFCC. Below is the Python implementation which returs DTW score; x and y are the MFCC matrix of voice sequences, with M1xN and M2xN dimensions:

def my_dtw (x, y):
    cost_matrix = cdist(x, y,metric='seuclidean')
    m,n = np.shape(cost_matrix)
    for i in range(m):
        for j in range(n):
            if ((i==0) & (j==0)):
                cost_matrix[i,j] = cost_matrix[i,j]

            elif (i==0):
                cost_matrix[i,j] = cost_matrix[i,j] + cost_matrix[i,j-1]

            elif (j==0):
                cost_matrix[i,j] = cost_matrix[i,j] + cost_matrix[i-1,j]

            else:
                min_local_dist = cost_matrix[i-1,j]

                if min_local_dist > cost_matrix[i,j-1]:
                    min_local_dist = cost_matrix[i,j-1]

                if min_local_dist > cost_matrix[i-1,j-1]:
                    min_local_dist = cost_matrix[i-1,j-1]

                cost_matrix[i,j] = cost_matrix[i,j] + min_local_dist
    return cost_matrix[m-1,n-1]

Upvotes: 0

BIOjack
BIOjack

Reputation: 71

The matrix of MxN can be represented as 1D-vector MxN length.

so, you have pattern1

p1[M*N], len=i, 'silence-HHHEEEEELLLLLOOOOOOOO-silence' sound;

then, second

p2[M*N], len=j, like 'HHHHHHEEELLOOOO'

then DTW by manhattan, euclidean, Bray-Curtis, etc distance calculation, you get output 2d matrix, there will be a path with minimum weigth.

Upvotes: 2

Related Questions