how do I estimate SNR from a single audio file containing speech? I know of two methods: log power histogram pecentile difference (aka "NIST quick method"), described here: http://labrosa.ee.columbia.edu/~dpwe/tmp/nist/doc/stnr.txt 10*log10( (S-N)/N ), where S = sum{x[i]^2 * e[i]} N = sum{x[i]^2 * (1-e[i])} e[i] some sort of voice activity detection (speech/non-speech indicator) are there any better methods that do not require stereo data (or data in both clean and noisy version)? I also would like to avoid the "second method" described in the NIST document (see 1.) that makes strong assumptions about the distributions.

Reputation: 7562

methods for estimating SNR of an audio file?

how do I estimate SNR from a single audio file containing speech? I know of two methods:

log power histogram pecentile difference (aka "NIST quick method"), described here: http://labrosa.ee.columbia.edu/~dpwe/tmp/nist/doc/stnr.txt
10*log10( (S-N)/N ), where
- S = sum{x[i]^2 * e[i]}
- N = sum{x[i]^2 * (1-e[i])}
- e[i] some sort of voice activity detection (speech/non-speech indicator)

are there any better methods that do not require stereo data (or data in both clean and noisy version)? I also would like to avoid the "second method" described in the NIST document (see 1.) that makes strong assumptions about the distributions.

Upvotes: 3

Answers (1)

heyo

Reputation: 79

Human voice uses frequencies from 300 Hz to 3 kHz. This is what (old) telephone systems are using. Human voice never uses all these frequencies at a time, this is why we can do a frequency analysis for finding the noise floor - without any reference or voice activity detection e[i]:

Compute FFT with a frequency resolution of ~ 10 - 20 Hz. With a samplerate of 48 kHz you would use an FFT length of samplerate/resolution = 4800 samples, which should the get rounded to the nearest power of 2, which is 4096
Identify the necessary bins which hold the results from 300 - 3000 Hz. The bin index k holds the result for frequency k*samplerate/FFT_length. For above 48 kHz input and FFT length 4096 this is k(300 Hz) = 300 * 4096 / 48000 ~= 25 and k(3000 Hz) = 3000 * 4096 / 48000 ~= 250.
Calculate the energy in each necessary bin: E[k] = FFT[k].re ^2 + FFT[k].im ^2. It depends on your FFT algorithm "where" the real and imaginary parts are written.
N = min{ E[k=25..250] } * number_of_bins (=250-25+1)
S = sum{ E[k=25..250] }
SNR = (S-N)/N. The level is 10*log10(SNR)
As the SNR varies over time, go back to step 1 with some new samples - probably with some overlap

Upvotes: 8

methods for estimating SNR of an audio file?

Answers (1)

Related Questions