Reputation: 405
I want to build a speech recognition engine in ruby. I know i'll never get there, doing it just for fun. I need to get data for the frequencies of the sound stored in a wav file to compare with data i already have of different sounds that i want to recognize. I will write the code in ruby but i dont think there are any libraries for this written in ruby, they would be too slow if there were any anyway. The good thing about ruby is I'll be able to use libraries for .net via IronRuby or Java via Jruby. How can i get the frequency data?
Upvotes: 3
Views: 3379
Reputation: 1656
You should read some papers about speaker recognition . And also you may find many libraries on the Internet to solve this problem. To build a speaker recognition system (either an identification system or a verification system) you need :
Good audio features (you want to find something that will describe univocally the voice of each speaker you have in the data set) most of the audio features are extracted on the sort term spectrum (which means on the FFT of the signal taken on small frames of your audio signal where the signal is supposed to be stationary ). But we never take the spectrum itself ( log of FFT) as a descriptor (too many un useful information I the spectrum). What is most important to describe the voice of someone is the envelope of the spectrum. You should definitely have a look at the audio descriptor called MFCC (for Mel frequency cepstral coefficients) which is the mot widely used audio features for speaker reco tasks.
then you also need a good classifier (something like GMM, SVM ...) because this problem is solved using supervised machine learning algorithms. Basically you need to train a model for each speaker you want to recognize and then you will test your model using data that has not been used to train. The model
Upvotes: 1
Reputation: 164291
A wave file is not too complicated, in essence it is just a series of audio samples: http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html.
Once you can read the samples, next step would be to run them through a FFT transformation, in order to get the frequency content. There should be some open source implementation you can use, or you could implement one yourself.
What you are trying to do require some understanding of audio and the mathematics behind signal processing, so perhaps you would want to start with a book on the subject.
Upvotes: 3