chathux
chathux

Reputation: 831

Compare two spoken words with MFCC and DTW using Aquila library

I am trying to find the similarity between spoken words using aquila library. my current approach is as follows.
1) First i break down the spoken word into smaller frames.
2) then apply MFCC for each frame and store the result in a vector.
3) finally calculate the distance using DTW.

this is the code i am using.

int frame_size = 1024;

Aquila::WaveFile waveIn0("start_1.wav");
Aquila::FramesCollection frameCollection0(waveIn0, frame_size);
vector<vector<double>> dtwdt0;
Aquila::Mfcc mfcc0(frame_size);
for(int i = 0; i < frameCollection0.count() ; i++)
{
    Aquila::Frame frame = frameCollection0.frame(i);
    vector<double> mfccValues = mfcc0.calculate(frame);
    dtwdt0.push_back(mfccValues);
}

Aquila::WaveFile waveIn1("start_2.wav");
Aquila::FramesCollection frameCollection1(waveIn1, frame_size);
vector<vector<double>> dtwdt1;
Aquila::Mfcc mfcc1(frame_size);
for(int i = 0; i < frameCollection1.count(); i++)
{
    Aquila::Frame frame = frameCollection1.frame(i);
    vector<double> mfccValues = mfcc1.calculate(frame);
    dtwdt1.push_back(mfccValues);
}

Aquila::Dtw dtw(Aquila::euclideanDistance, Aquila::Dtw::PassType::Diagonals);
double distance_1 = dtw.getDistance(dtwdt0, dtwdt1);
cout << "Distance : " << distance_1 << endl;

It works fine except it is not accurate enough. sometimes it shows less distance between spoken words 'start' and 'stop' rather than two spoken 'start'.

is my code correct? how to improve the program so i can get more accurate result? any help will be appreciated.

Thanks.

Upvotes: 2

Views: 1503

Answers (1)

Nikolay Shmyrev
Nikolay Shmyrev

Reputation: 25220

Overall DTW is not easy thing to implement. You might check this lecture in order to see what must be done:

http://www.fit.vutbr.cz/~grezl/ZRE/lectures/08_reco_dtw_en.pdf

You need to try to figure out why distance between start and stop is smaller than between to starts. Is it due to different volume or did you use different voices? There could be many issues. The distance between identical samples must be 0. You might want to dump alignment frame-by-frame between samples in order to see what is aligned to what.

Ideally DTW should not allow very big jumps between frames. Lecture above describes that.

For better accuracy feature extraction pipeline should include lifter for cepstrum and cepstral mean normalization (essentially volume normalization).

Audio you use should not include silence, you need to use voice activity detection to strip it.

Also, I'm not sure about sample rate for your audio, but frame size of 1024 samples is probably too large.

Upvotes: 2

Related Questions