Reputation: 355
I'm working on a project to compare how similar someone's singing is to the original artist. Mostly interested in the pitch of the voice to see if they're in tune.
The audio files are in .wav format and I've been able to load them with the wave module and convert them to Numpy arrays. Then I built a frequency and a time vector to plot the signal.
raw_audio = wave.open("myAudio.WAV", "r")
audio = raw_audio.readframes(-1)
signal = np.frombuffer(audio, dtype='int16')
fs = raw_audio.getframerate()
timeDelta = 1/(2*fs)
#Get time and frequency vectors
start = 0
end = len(signal)*timeDelta
points = len(signal)
t = np.linspace(start, end, points)
f = np.linspace(0,fs,points)
If I have another signal of the same duration (they're landing at approximately 5-10 seconds). What would be the best way to compare these two signals for similarity?
I've thought of comparing the frequency domains and autocorrelation but I feel that both of those methods have a lot of drawbacks.
Upvotes: 2
Views: 6704
Reputation: 76
I am faced with a similar Problem of evaluating the similarity of two Audio signals (one real, one generated by a machine learning pipeline). I have signal parts, where the comparison is very time-critical (time-difference between peaks representing arrival of different early reflections) and for this I will try out calculating the cross-correlation between the signals (more on that here: https://www.researchgate.net/post/how_to_measure_the_similarity_between_two_signal )
Since natural recordings of two different voices will be quite different in time domain, this would probably not be ideal for your problem.
For signals where Frequency information (like pitch and timbre) is of greater interest, I would work in frequency domain. You can for example calculate short-time-ffts (stft) or cqt (a more musical representation of the spectrum as it is mapped to octaves) for the two signals and then compare the similarities for example by calculating the Mean-Squared-Error (MSE) between the time windows of the two signals. Before transforming you should off course normalize the signals. STFT, CQT and normalization can easily be done and visualized with librosa
see here: https://librosa.org/doc/latest/generated/librosa.util.normalize.html
here: https://librosa.org/doc/latest/generated/librosa.cqt.html?highlight=cqt
here: https://librosa.org/doc/latest/generated/librosa.stft.html
and here: https://librosa.org/doc/main/generated/librosa.display.specshow.html)
Two things about this approach:
Dont make the time windows of your stfts too short. Spectra of human voices start somewhere in the hundret-hertz range (https://av-info.eu/index.html?https&&&av-info.eu/audio/speech-level.html here 350 Hz is given as the low end). So the Amount of samples in (or length of) your stft-time-windows should at least be:
(1 / 350 Hz) * sampling frequency
So if your recordings have 44100 Hz sampling frequency, your time window must be at least
(1 / 350 Hz) * 44100 Hz = 0.002857... sec * 44100 Samples / second = 126 Samples long.
Make it 128, thats a nicer number. That way you guarantee that a sound wave with fundamental frequency of 350 Hz can still be "seen" for at least one full Period in a single window. Of course bigger windows will give you more exact spectral representation.
Before transforming you should make sure that the two signals you are comparing represent the same sound events at the same time. So all of this doesn't work if the two singers didn't sing the same thing or not at the same speed or there are different background noises in the signals. Provided that you have dry recordings of only the voices and these voices sing the same thing at equal speed, you just need to make sure that the signal starts align. In general, you need to make sure that sound events (e.g. transients, silence, notes) align. When there is a long AAAH-sound in one signal, there should also be a long AAAh-sound in the other signal. You can make your evaluation somewhat more robust by increasing the stft windows even further, this will reduce time resolution (you will get less spectral representations of signals) but more sound events are evaluated together in one time window.
You could of course just generate one fft for each signal over the entire length but the results will be more meaningful if you generate stfts or cqts (or some other transform better suited for human hearing) over equal lengthed, short time windows, then calculate the mse for each pair of time windows (first time window of signal 1 and first window of signal 2, then the second window pair, then the third and so on).
Hope this helps.
Upvotes: 4