user4190374
user4190374

Reputation: 49

Audio File Speech Recognition in Python - location of word in seconds

I've been experimenting with the python speech recognition library https://pypi.python.org/pypi/SpeechRecognition/

To read downloaded versions of the BBC shipping forecast. The clipping of those files from live radio to the iplayer are obviously automated and not very accurate - so usually there is some audio before the forecast itself starts - a trailer, or the end of the news. I don't need to be that accurate but I'd like to get speech recognition to recognise the phrase "and now the shipping forecast" (or just 'shipping' would do actually) and cut the file from there.

My code so far (adpated form an example) transcribes and audio file of the forecast and uses a formula (based on 200 words per minute) to predict where the word shipping comes, but it's not proving to be very accurate.

Is there a way of getting the actual 'frame' or second onset that pocketsphinx itself detected for that word? I can't find anything in the documentation. Anyone any ideas?

import speech_recognition as sr

AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "test_short2.wav")

# use the audio file as the audio source
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
    audio = r.record(source)  # read the entire audio file

# recognize speech using Sphinx
try:
    print "Sphinx thinks you said "
    returnedSpeech = str(r.recognize_sphinx(audio))

    wordsList = returnedSpeech.split()
    print returnedSpeech
    print "predicted loacation of start ", float(wordsList.index("shipping")) * 0.3


except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))

Upvotes: 1

Views: 3169

Answers (1)

Nikolay Shmyrev
Nikolay Shmyrev

Reputation: 25210

You need to use pocketsphinx API directly for such things. It is also highly recommended to read pocketsphinx documentation on keyword spotting.

You can spot for keyphrase as demonstrated in example:

config = Decoder.default_config()
config.set_string('-hmm', os.path.join(modeldir, 'en-us/en-us'))
config.set_string('-dict', os.path.join(modeldir, 'en-us/cmudict-en-us.dict'))
config.set_string('-keyphrase', 'shipping forecast')
config.set_float('-kws_threshold', 1e-30)

stream = open(os.path.join(datadir, "test_short2.wav"), "rb")

decoder = Decoder(config)
decoder.start_utt()
while True:
    buf = stream.read(1024)
    if buf:
         decoder.process_raw(buf, False, False)
    else:
         break
    if decoder.hyp() != None:
        print ([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])
        print ("Detected keyphrase, restarting search")
        decoder.end_utt()
        decoder.start_utt()

Upvotes: 1

Related Questions