how to get frequency of audio at specific time in python?

Question

I am working on mp3 file to get the speech in the form of text with speech_recognization python. Here I need to get the text from mp3 file for every 10sec. I am not able accurate results. so my idea is to get the frequency of the audio for every 10sec, if the frequency is too low then I want to convert audio to text to that point.(I don't want to use numpy, scipy, matplotlib).

Please give your valuable suggestions.

Anil_M · Accepted Answer

In order to detect low frequency you will need to use STFFT [ Short Time Fast Fourier Transformation] algorithms. A better way may be to detect amplitude [ loudness ] and silence.

PYDUB has easier way of accomplishing loudness in DBFS / Maximum volume and RMS volume detection.

You can install pydub using
pip install pydub

As far as splitting audio in 10 second interval and feeding it through speech_recognition module in python, I got a crude program working finally. It has few kinks and by no way a comprehensive one, but it gives some insight into the direction you are looking for.And it works to provide a proof of concept. The program works with WAV files, but you can replace wav format with MP3 to get it working with MP3.

Setup

Basically, I downloaded free / open-source pre-recorded wav file from this site And concatenated them using PYDUB.

[https://evolution.voxeo.com/library/audio/prompts/numbers/index.jsp]

When I tested individual files, only google translator was working so I got rid of others to make code clean.

Sample python code for speech-recognition was downloaded from here, https://github.com/Uberi/speech_recognition/blob/master/examples/wav_transcribe.py

So the program uses pydub to read and slice audio file that has spoken words from 0 through 100 at the interval of 10 seconds. Due to the nature of pre-recorded file and the fact that this program does not account for dynamic slicing, as you will see in the output , the recognition is not co-ordinated.

I believe a better program that recognizes silence dynamically and slices audio accordingly can be developed.

This was developed on Windows system with python 2.7

Program

############################### Declarations ##############################################

import os
from pydub import AudioSegment
import speech_recognition as sr



#Read main audio file to be processed. Assuming in the same folder as this script
sound = AudioSegment.from_wav("0-100.wav")

#slice time are in seconds
tenSecSlice = 10 * 1000 

#Total Audio Length
audioLength = len(sound)

#Get quotient and remainder 
q, r = divmod(audioLength, tenSecSlice)

#Get total segments and rounds to next greater integer 
totalSegments= q + int(bool(r)) 

exportPath = "\tempDir\"

####################################################
#Function for Speech Recognition  
#downloaded & modified  from above mentioned site  
####################################################  


def processAudio(WAV_FILE):
    r = sr.Recognizer()
    with sr.WavFile(WAV_FILE) as source:
        audio = r.record(source) # read the entire WAV file

    # recognize speech using Google Speech Recognition
    try:
        # for testing purposes, we're just using the default API key
        # to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
        # instead of `r.recognize_google(audio)`
        print("Google Speech Recognition thinks you said " + r.recognize_google(audio))
    except sr.UnknownValueError:
        print("Google Speech Recognition could not understand audio")
    except sr.RequestError as e:
        print("Could not request results from Google Speech Recognition service; {0}".format(e))

############################### Slice Audio and Process ################################

#Declare empty List

exportPath = "tempDir\"
segmentList = []
n=0

#Iterate through slices  and feed to speech recognition function
while n < totalSegments:
    firstPart = (tenSecSlice * n)
    secondPart =  (tenSecSlice * (n + 1))

    print ("Making slice  from %d to %d  (sec)" % (firstPart /1000 , secondPart /1000))
    print ("Recognizing words from  %d to %d " % (firstPart /1000 , secondPart /1000))
    tempObject = sound[ firstPart :secondPart ]
    myAudioFile = exportPath + "slice" + str(n) +".wav"
    tempObject.export(myAudioFile , format="wav")
    n += 1
    processAudio(myAudioFile)
    print ("")

############################### End Program ##############################################

OUTPUT

    Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32  
Type "copyright", "credits" or "license()" for more information.  
================================ RESTART ================================  

Making slice  from 0 to 10 (sec)  
 Recognizing words from  0 to 10  
Google Speech Recognition thinks you said 0 1 2 3 4 5 6 7 8 9 10 11  

Making slice  from 10 to 20 (sec)  
 Recognizing words from  10 to 20  
Google Speech Recognition thinks you said 12 13 14 15 16 17 18 19 20 21  

Making slice  from 20 to 30 (sec)  
 Recognizing words from  20 to 30  
Google Speech Recognition thinks you said 21 22 23 24 25 26 27 28 29  

Making slice  from 30 to 40 (sec)  
 Recognizing words from  30 to 40  
Google Speech Recognition thinks you said 30 31 32 33 34 35 36 37 38  

Making slice  from 40 to 50 (sec)  
 Recognizing words from  40 to 50  
Google Speech Recognition thinks you said 39 40 41 42 43 44 45 46 47  

Making slice  from 50 to 60 (sec)  
 Recognizing words from  50 to 60  
Google Speech Recognition thinks you said 48 49 50 51 52 53 54 55 56  

Making slice  from 60 to 70 (sec)  
 Recognizing words from  60 to 70  
Google Speech Recognition thinks you said 57 58 59 60 61 62 63 64 65  

Making slice  from 70 to 80 (sec)  
 Recognizing words from  70 to 80  
Google Speech Recognition thinks you said 66 67 68 69 70 71 72 73 74  

Making slice  from 80 to 90 (sec)  
 Recognizing words from  80 to 90  
Google Speech Recognition thinks you said 75 76 77 78 79 80 81 82 83  

Making slice  from 90 to 100 (sec)  
 Recognizing words from  90 to 100  
Google Speech Recognition thinks you said 84 85 86 87 88 89 90 91 92  

Making slice  from 100 to 110 (sec)  
 Recognizing words from  100 to 110  
Google Speech Recognition thinks you said 93 94 95 96 97 98 99 100

how to get frequency of audio at specific time in python?

Answers (1)

Setup

Program

OUTPUT

Related Questions