Audio File Speech Recognition in Python - location of word in seconds

Question

I've been experimenting with the python speech recognition library https://pypi.python.org/pypi/SpeechRecognition/

To read downloaded versions of the BBC shipping forecast. The clipping of those files from live radio to the iplayer are obviously automated and not very accurate - so usually there is some audio before the forecast itself starts - a trailer, or the end of the news. I don't need to be that accurate but I'd like to get speech recognition to recognise the phrase "and now the shipping forecast" (or just 'shipping' would do actually) and cut the file from there.

My code so far (adpated form an example) transcribes and audio file of the forecast and uses a formula (based on 200 words per minute) to predict where the word shipping comes, but it's not proving to be very accurate.

Is there a way of getting the actual 'frame' or second onset that pocketsphinx itself detected for that word? I can't find anything in the documentation. Anyone any ideas?

import speech_recognition as sr

AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "test_short2.wav")

# use the audio file as the audio source
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
    audio = r.record(source)  # read the entire audio file

# recognize speech using Sphinx
try:
    print "Sphinx thinks you said "
    returnedSpeech = str(r.recognize_sphinx(audio))

    wordsList = returnedSpeech.split()
    print returnedSpeech
    print "predicted loacation of start ", float(wordsList.index("shipping")) * 0.3


except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))

Audio File Speech Recognition in Python - location of word in seconds

Answers (1)

Related Questions