I would like to build a program to detect how close a user's audio recording is to another recording in order to correct the user's pronunciation. For example: I record myself saying "Good morning" I let a foreign student record "Good morning" Compare his recording to mine to see if his pronunciation was good enough. I've seen this in some language learning tools (I believe Rosetta Stone does this), but how is it done? Note we're only dealing with speech (and not, say, music). What are some algorithms or libraries I should look into?

A lot of people seem to be suggesting some sort of edit distance, which IMO is a totally wrong approach for determining the similarity of two speech patterns, especially for patterns as short as OP is implying. The specific algorithms used by speech-recognition in fact are nearly the opposite of what you would like to use here. The problem in speech recognition is resolving many similar pronunciations to the same representation. The problem here is to take a number of slightly different pronunciations and get some kind of meaningful distance between them. I've done quite a bit of this stuff for large scale data science, and while I can't comment on exactly how proprietary programs do it, I can comment on how it's done in academia and provide a solution that is straightforward and will give you the power and flexibility that you want for this approach. Firstly: Assuming that what you have is some chunk of audio without any filtering done on it. Just as it would be acquired from a microphone. The first step is to eliminate background noise. There are a number of different methods for this, but I'm going to assume that what you want is something that will work well without being incredibly difficult to implement. Filter the audio using scipy's filtering module here . There are a lot of frequencies that microphones pick up that are simply not useful for categorizing speech. I would suggest either a Bessel or a Butterworth filter to ensure that your waveform is persevered through filtering. The fundamental frequencies for everyday speech are generally between 800 and 2000 Hz ( reference ) so a reasonable cutoff would be something like 300 to 4000 Hz, just to make sure you don't lose anything. Look for the least active portion of speech and assume that is a reasonable representation of background noise. At this point you're going to want to run a series of fourier transforms along your data (or generate a spectrogram) and find the part of your speech recording that has the lowest average frequency response. Once you have that snapshot, you should subtract it from all other points in your audio sample. At this point should should have an audio file that is mostly just your user's speech and should be ready to be compared to another file that has gone through this process. Now, we want to actually clip the sound and compare this clip to some master clip. Secondly: You're going to want to come up with a distance metric between two speech patterns, there are a number of ways to do this, but I'm going to assume we have the output of part one and some master file that has been through similar processing. Generate a spectrogram of the audio file in question ( example ). The output from this is ultimately going to be an image that can be represented as a 2-d array of frequency response values. A spectrogram is essentially a fourier transform over time where the colour corresponds to intensity. Use OpenCV (has python bindings, example ) to run blob detection on your spectrogram. Effectively this is going to look for the big colorful blob in the middle of your spectrogram, and give you some limits on this. Effectively, what this should do, is return a significantly more sparse version of the original 2d-array that solely represents the speech in question. (With the assumption that your audio file will have some trailing stuff on the front and back ends of recording) Normalize the two blobs to account for differences in speech speed. Everyone talks at a different speeds, and as such your blobs will probably have different sizes along the x-axis (time). This will ultimately introduce a level of checks in your algorithm that you don't want for the speed of speech. This step isn't needed if you also want to make sure that they speak at the same speed as the master copy, but I would suggest it. Basically you want to stretch out the shorter version by multiplying it's time axis by some constant that's just the ratio of the lengths of your two blobs. You should also normalize the two blobs based on maximum and minimum intensity to account for people that talk at different volumes. Again, this is up to your discretion, but to fix this you should find similar ratios for the total span of intensities that you have as well as the two recording's max intensities and make sure that these two values match up between your 2-d arrays. Third: Now that you have 2-d arrays representing your two speech events, that should in theory contain all of their useful information it's time to directly compare them. Luckily, comparing two matrices is a well-solved problem and there are a number of ways to move forward. Personally I would suggest using a metric like Cosine Similarity to determine the difference between your two blobs, but that's not the only solution and while it'll give you a quick validation, you can do better. You could try subtracting one matrix from the other and get an evaluation of how much difference there is between them, which would probably be more accurate than simple cosine distance. It might be overkill, but you could assume that there are certain regions of speech that matter more or less for evaluating difference between blobs (it might not matter if someone uses a long i instead of a short i, but a g instead of a k could be a different word entirely). For something like that you'd want to develop a mask for the difference array in the previous step and multiply all your values by that. Whichever method you choose, you can now simply set some difference threshold and make sure that the difference between the two blobs is below your desired threshold. If it is, the captured speech is similar enough to be correct. Otherwise have them try again. I hope that's helpful, and again, I can't assure you that this is the exact algorithm that a company uses since that information is hugely proprietary and not open for the public, but I can assure you that methods similar to these are used in the very best papers in academia and that these methods will get you a great balance of accuracy and ease of implementation. Let me know if you have any questions, and good luck with your future data science exploits!

algorithmmachine-learningaudio

foobar

Reputation: 11384

How to detect how similar a speech recording is to another speech recording?

I would like to build a program to detect how close a user's audio recording is to another recording in order to correct the user's pronunciation. For example:

I record myself saying "Good morning"
I let a foreign student record "Good morning"
Compare his recording to mine to see if his pronunciation was good enough.

I've seen this in some language learning tools (I believe Rosetta Stone does this), but how is it done? Note we're only dealing with speech (and not, say, music). What are some algorithms or libraries I should look into?

Upvotes: 31

Answers (8)

Aikon Mogwai

Reputation: 5245

I know this question is out of date but...

To solve a similar problem I used Google Speech Recognized API to check WHAT was said and visual compare scaled wave forms of volume changes to detect differences in rhythm.

Code & video of the result.

Upvotes: 4

Slater Victoroff

Reputation: 21934

A lot of people seem to be suggesting some sort of edit distance, which IMO is a totally wrong approach for determining the similarity of two speech patterns, especially for patterns as short as OP is implying. The specific algorithms used by speech-recognition in fact are nearly the opposite of what you would like to use here. The problem in speech recognition is resolving many similar pronunciations to the same representation. The problem here is to take a number of slightly different pronunciations and get some kind of meaningful distance between them.

I've done quite a bit of this stuff for large scale data science, and while I can't comment on exactly how proprietary programs do it, I can comment on how it's done in academia and provide a solution that is straightforward and will give you the power and flexibility that you want for this approach.

Firstly: Assuming that what you have is some chunk of audio without any filtering done on it. Just as it would be acquired from a microphone. The first step is to eliminate background noise. There are a number of different methods for this, but I'm going to assume that what you want is something that will work well without being incredibly difficult to implement.

Filter the audio using scipy's filtering module here. There are a lot of frequencies that microphones pick up that are simply not useful for categorizing speech. I would suggest either a Bessel or a Butterworth filter to ensure that your waveform is persevered through filtering. The fundamental frequencies for everyday speech are generally between 800 and 2000 Hz (reference) so a reasonable cutoff would be something like 300 to 4000 Hz, just to make sure you don't lose anything.
Look for the least active portion of speech and assume that is a reasonable representation of background noise. At this point you're going to want to run a series of fourier transforms along your data (or generate a spectrogram) and find the part of your speech recording that has the lowest average frequency response. Once you have that snapshot, you should subtract it from all other points in your audio sample.
At this point should should have an audio file that is mostly just your user's speech and should be ready to be compared to another file that has gone through this process. Now, we want to actually clip the sound and compare this clip to some master clip.

Secondly: You're going to want to come up with a distance metric between two speech patterns, there are a number of ways to do this, but I'm going to assume we have the output of part one and some master file that has been through similar processing.

Generate a spectrogram of the audio file in question (example). The output from this is ultimately going to be an image that can be represented as a 2-d array of frequency response values. A spectrogram is essentially a fourier transform over time where the colour corresponds to intensity.
Use OpenCV (has python bindings, example) to run blob detection on your spectrogram. Effectively this is going to look for the big colorful blob in the middle of your spectrogram, and give you some limits on this. Effectively, what this should do, is return a significantly more sparse version of the original 2d-array that solely represents the speech in question. (With the assumption that your audio file will have some trailing stuff on the front and back ends of recording)
Normalize the two blobs to account for differences in speech speed. Everyone talks at a different speeds, and as such your blobs will probably have different sizes along the x-axis (time). This will ultimately introduce a level of checks in your algorithm that you don't want for the speed of speech. This step isn't needed if you also want to make sure that they speak at the same speed as the master copy, but I would suggest it. Basically you want to stretch out the shorter version by multiplying it's time axis by some constant that's just the ratio of the lengths of your two blobs.
You should also normalize the two blobs based on maximum and minimum intensity to account for people that talk at different volumes. Again, this is up to your discretion, but to fix this you should find similar ratios for the total span of intensities that you have as well as the two recording's max intensities and make sure that these two values match up between your 2-d arrays.

Third: Now that you have 2-d arrays representing your two speech events, that should in theory contain all of their useful information it's time to directly compare them. Luckily, comparing two matrices is a well-solved problem and there are a number of ways to move forward.

Personally I would suggest using a metric like Cosine Similarity to determine the difference between your two blobs, but that's not the only solution and while it'll give you a quick validation, you can do better.
You could try subtracting one matrix from the other and get an evaluation of how much difference there is between them, which would probably be more accurate than simple cosine distance.
It might be overkill, but you could assume that there are certain regions of speech that matter more or less for evaluating difference between blobs (it might not matter if someone uses a long i instead of a short i, but a g instead of a k could be a different word entirely). For something like that you'd want to develop a mask for the difference array in the previous step and multiply all your values by that.
Whichever method you choose, you can now simply set some difference threshold and make sure that the difference between the two blobs is below your desired threshold. If it is, the captured speech is similar enough to be correct. Otherwise have them try again.

I hope that's helpful, and again, I can't assure you that this is the exact algorithm that a company uses since that information is hugely proprietary and not open for the public, but I can assure you that methods similar to these are used in the very best papers in academia and that these methods will get you a great balance of accuracy and ease of implementation. Let me know if you have any questions, and good luck with your future data science exploits!

Upvotes: 87

mnaa

Reputation: 424

you can use Musicg https://code.google.com/p/musicg/ as roy zhang suggested. In android, just include musicg jar file in your android project and use it. A tested example:

import com.musicg.wave.Wave;
import com.musicg.fingerprint.FingerprintSimilarity;


        //somewhere in your code add
        String file1 = Environment.getExternalStorageDirectory().getAbsolutePath();
        file1 += "/test.wav";

        String file2 = Environment.getExternalStorageDirectory().getAbsolutePath();
        file2 += "/test.wav";

        Wave w1 = new Wave(file1);
        Wave w2 = new Wave(file2);


        FingerprintSimilarity fps = w1.getFingerprintSimilarity(w2);
        float score = fps.getScore();
        float sim = fps.getSimilarity();



        Log.d("score", score+"");
        Log.d("similarities", sim+"");

Good luck

Upvotes: 1

renathy

Reputation: 5355

You have to look into speech recognition algorithms. I understand that you don't need to translate speech to text (that is done by speech recognition algorithms), however, in your case many algorithms would be the same.

Probably, HMM would be helpful here (hidden markov models). Also look into here: http://htk.eng.cam.ac.uk/

Upvotes: 0

Janak

Reputation: 5052

If this is only to check the pronunciation [of course with different accent], you can do this :

Step 1 : Using some voice tool [say dragon dictation], you can have the text with you.

Step 2 : Compare the string or the word formed and compare it with the string that actually was meant to be pronounced.

Step 3 : If you find any discrepancy in the strings, means the word was not spelled correctly. And you can suggest the correct pronunciation.

Upvotes: 0

TheBuzzSaw

Reputation: 8836

A carefully configured Levenshtein distance should do the trick.

Upvotes: 2

Rishi Dua

Reputation: 2334

Idea: The way biotechnologists align two protein sequences is as follows: Each sequence is represented as a string on an alphabet as(A/C/G/T - these are different types of proteins, irrelevant for us), where each letter (here, an entry) represents a particular amino acid. The quality of an alignment (its score) is calculated from the similarity of each pair of corresponding entries, and the number and length of the blank entries that need to be inserted to produce that alignment.

Same algorithm (http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm) can be used for pronunciation, from substitution frequencies in a set of alternate pronunciations. Then you can calculate alignment scores to measure the similarity between the two pronunciations in a way that is sensitive to the differences between phonemes. Measures of similarity that can be used here are Levenshtein distance, phoneme error rate, and word error rate.

Algorithms The minimum number of insertions, deletions and substitutions required for transformation of one sequence into another is the Levenshtein distance. More info at http://php.net/manual/en/function.levenshtein.php Phoneme error rate (PER) is the Levenshtein distance between a predicted pronunciation and the reference pronunciation, divided by the number of phonemes in the reference pronunciation. Word error rate (WER) is the proportion of predicted pronunciations with at least one phoneme error to the total number of pronunciations.

Source: Did an Internship on this at UW-Madison

Upvotes: 6

roy zhang

Reputation: 191

The musicg api https://code.google.com/p/musicg/ has a audio fingerprint generator and scorer along with source code to show how its done.

I think it looks for the most similar point in each track, then scores based on how far it can match.

It might look something like

import com.musicg.wave.Wave
   com.musicg.fingerprint.FingerprintSimilarity
   com.musicg.fingerprint.FingerprintSimilarityComputer
   com.musicg.fingerprint.FingerprintManager

double score =
new FingerprintsSimilarity(
    new Wave("voice1.wav").getFingerprint(),
    new Wave("voice2.wav").getFingerprint() ).getSimilarity();

Upvotes: 6

How to detect how similar a speech recording is to another speech recording?

Answers (8)

Related Questions