Mark Kronborg
Mark Kronborg

Reputation: 31

Java - Questions to Estimating fundamental frequency

im trying to estimate the fundamental frequency from a .wav file which contains a recording of the speech of 1 word.

What i've tried to do is to read the file with audioInputStream. The format is PCM_SIGNED 44100.0 Hz, 16 bit, stereo, 4 bytes/frame, little-endian.

Therefore i have made a new buffer to contain just one channel. This code achieves that:

double [] audioRight = new double[audioBytes.length/2]; 
for(int i = 0, k = 0; i <= audioBytes.length-1; i+=4, k+=2){
    audioRight[k]=audioBytes[i];
    audioRight[k+1]=audioBytes[i+1];
}

Then the data was moved to a fftBuffer, which is twice the size, and then an DFT is applied. The library used is JTransform. the function used is called realForwardFull.

DoubleFFT_1D fftDo= new DoubleFFT_1D(audioLeft.length);
double[] fftBuffer = new double [audioLeft.length*2];

for (int i = 0; i < audioLeft.length; i++){
     fftBuffer[i] = audioLeft[i];
}
fftDo.realForwardFull(fftBuffer);

This gives a list of complex numbers which I use to calculate the magnitude/amplitude of each complex number in order to make a power spectrum.

The formula used to get the amplitude Amplitude=sqrt(IMIM+RERE).

This provides an array of amplitudes which I apply the harmonic summation method to. Harmonic summation is where the index + 3 harmonics that gives the highest sum is the index that represents the fundamental frequency.

double top_sum = 0;
double first_index = 0;
double sum = 0;
double f_0 = 0;
double FR = audioInputStream.getFormat().getSampleRate()/2/ampBuffer.length;

for (int i = 50; i <= ampBuffer.length/4-1; i++){
sum = ampBuffer[i]+ampBuffer[i*2]+ampBuffer[i*3]+ampBuffer[i*4];
     if (top_sum < sum){
 top_sum=sum;
 first_index = i;

This index however needs to be mapped back to the correct frequency domnain. To my understanding that should be done by saying (index / fttBuffer.length)*sampleRate.

This provides an estimate of the fundamental frequency.

The result however is not "correct". I have several different .wav files to test on, and with most of them the result is way outside the expected range. For the same female voices, three different words gives the results 40, 13 and 360. All three results are expected to be in the range 250 to 350, approximately.

Some of the issues I think is causing this is the amplitude buffer values. When plotted the graph doesnt show any clear peaks that represents the harmoncis.

Here's an image of the graph:

Amplitudes

I know this was a lot of information, but I believe more information makes it easier to understand what has been done.

RECAP: What I am unsure of is the amplitude data. Does the values make sense? Are they plotted correctly? Do i need to do something with the data before i search it for the harmoncis and find the fundamental frequency?

I have considered to apply some kind of windowing, because I have a suspicion that leakage might be why the peaks that the plot does have isnt harmonics to each other.

Any help or suggestions would be appreciated. In advance, thank you for your help!

EDIT: As an attempt to what was suggested:

 ByteBuffer buf = ByteBuffer.wrap(audioBytes);
         buf.order(ByteOrder.LITTLE_ENDIAN);
         double[] audio = new double[audioBytes.length/2];  


         for(int i = 0; i < audioBytes.length/2; i++) {
             short s = buf.getShort();
             double mono = (double) s;
             double mono_norm = mono / 32768.0;

             audio[i]=mono_norm;


         }

Now one channel of the pcm data should be saved in the array audio[].

Upvotes: 3

Views: 809

Answers (1)

DrKoch
DrKoch

Reputation: 9772

Some general hints:

You say you try to estimate the fundamental frquency of one spoken word. A "word" consists of several consonants and vowels (or better phonemes). Each of the "vowels" will have a different fundamental frequency and in most cases the frequency will even change within one vowel (which generates the "melody" of our sentences). Thius means you should estimate the fundamental frequency / pitch of a very short interval of the speech and make sure you are looking at a vowel (consonants are some form of noise and have cyclic components).

So the first sterp should be to generate a spectogram of your word.

Then you may calculate Short-Term-FFTs of the interesting parts and proceed with harmonic summation.

You will get better results with a short term autocorrelation function however.

Other things to research: Pitch-Detection, Cepstrum

Upvotes: 1

Related Questions