Tomek Cejner
Tomek Cejner

Reputation: 1222

SFSpeechRecognizer - detect end of utterance

I am hacking a little project using iOS 10 built-in speech recognition. I have working results using device's microphone, my speech is recognized very accurately.

My problem is that recognition task callback is called for every available partial transcription, and I want it to detect person stopped talking and call the callback with isFinal property set to true. It is not happening - app is listening indefinitely.

Is SFSpeechRecognizer ever capable of detecting end of sentence?

Here's my code - it is based on example found on the Internets, it is mostly a boilerplate needed to recognize from microphone source. I modified it by adding recognition taskHint. I also set shouldReportPartialResults to false, but it seems it has been ignored.

    func startRecording() {

    if recognitionTask != nil {
        recognitionTask?.cancel()
        recognitionTask = nil
    }

    let audioSession = AVAudioSession.sharedInstance()
    do {
        try audioSession.setCategory(AVAudioSessionCategoryRecord)
        try audioSession.setMode(AVAudioSessionModeMeasurement)
        try audioSession.setActive(true, with: .notifyOthersOnDeactivation)
    } catch {
        print("audioSession properties weren't set because of an error.")
    }

    recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
    recognitionRequest?.shouldReportPartialResults = false
    recognitionRequest?.taskHint = .search

    guard let inputNode = audioEngine.inputNode else {
        fatalError("Audio engine has no input node")
    }

    guard let recognitionRequest = recognitionRequest else {
        fatalError("Unable to create an SFSpeechAudioBufferRecognitionRequest object")
    }

    recognitionRequest.shouldReportPartialResults = true

    recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in

        var isFinal = false

        if result != nil {
            print("RECOGNIZED \(result?.bestTranscription.formattedString)")
            self.transcriptLabel.text = result?.bestTranscription.formattedString
            isFinal = (result?.isFinal)!
        }

        if error != nil || isFinal {
            self.state = .Idle

            self.audioEngine.stop()
            inputNode.removeTap(onBus: 0)

            self.recognitionRequest = nil
            self.recognitionTask = nil

            self.micButton.isEnabled = true

            self.say(text: "OK. Let me see.")
        }
    })

    let recordingFormat = inputNode.outputFormat(forBus: 0)
    inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
        self.recognitionRequest?.append(buffer)
    }

    audioEngine.prepare()

    do {
        try audioEngine.start()
    } catch {
        print("audioEngine couldn't start because of an error.")
    }

    transcriptLabel.text = "Say something, I'm listening!"

    state = .Listening
}

Upvotes: 37

Views: 15393

Answers (8)

fumoboy007
fumoboy007

Reputation: 5553

The existing isFinal property is misleadingly-named: it does not tell you whether the result is final but whether the stream of results has ended. It should have been named something like isEndOfStream.

To check whether the result is final, you can check whether the speechRecognitionMetadata property is nil. If it is nil, the result is partial; otherwise, the result is final.

extension SFSpeechRecognitionResult {
  var isPartialResult: Bool {
    return !isFinal && speechRecognitionMetadata == nil
  }
}

Important: Not sure if this is a bug, but in iOS 17.5.1, the server-side speech recognition implementation only returns a single final result, when endAudio is called; it doesn’t seem to support the streaming case very well. Therefore, I recommend avoiding the server-side implementation by setting requiresOnDeviceRecognition to true.

Upvotes: 1

devdchaudhary
devdchaudhary

Reputation: 718

This is my version of how I solved this in SwiftUI, simply check for any changes in self.recognizedText in recognitionTask where self.recognizedText = result.bestTranscription.formattedString

speechHandler.isRecording is a property that gets set to true when recording starts and to false when it stops.

Inside our view if there's been no change in the recognizedText for 2 seconds then we can safely say that the user has stopped speaking.

.onChange(of: speechHandler.recognizedText) { text in
    if speechHandler.isRecording {
        DispatchQueue.main.asyncAfter(deadline: .now() + 2) {
            if text == speechHandler.recognizedText {
                finishedSpeaking = true
            }
        }
    }
}

Upvotes: 0

You can find my solution here and it works quiet nicely for me: https://github.com/elviin/gestureai/blob/main/GestureAI/Speech/Speech2Text.swift

First detect possible sentences with the following extension:

extension SFTranscription {
private static let intervalBetweenSentences: TimeInterval = 1.0

var newSentenceStarted: Bool {
    let processed = Array(self.segments.reversed())
    guard processed.count > 1 else {
        return false
    }

    // print(processed.map { $0.substring }.joined(separator: " "))
    let last = processed[0]
    let previous = processed[1]

    var pause: TimeInterval = 0.0
    pause = last.timestamp - (previous.timestamp + previous.duration)
    if pause >= Self.intervalBetweenSentences, last.confidence > 0.0 {
        return true
    }
    return false
}

var lastClosedSentence: String {
    let processed = Array(self.segments.reversed())
    var wordsInSentenceReversed: [String] = []
    guard processed.count > 1 else {
        return ""
    }

    for (index, segment) in processed.enumerated() {
        wordsInSentenceReversed.append(segment.substring)

        let isFirstSegment = index == processed.count - 1
        var pause: TimeInterval = 0.0
        if isFirstSegment == false {
            let previousIndex = index + 1 // we are in reversed array
            let previousSegment = processed[previousIndex]
            pause = segment.timestamp - previousSegment.timestamp
        }

        // Once you come to a pause, stop seatching for older words.
        if (pause >= Self.intervalBetweenSentences && segment.confidence > 0.0) || isFirstSegment {
            break
        }
    }

    return wordsInSentenceReversed.reversed().joined(separator: " ")
}

}

Second use the new properties in the recogniser handler:

    nonisolated private func recognitionHandler(audioEngine: AVAudioEngine, result: SFSpeechRecognitionResult?, error: Error?) {
    let receivedError = error != nil

    if receivedError {
        audioEngine.stop()
        audioEngine.inputNode.removeTap(onBus: 0)
    }

    if let result, result.bestTranscription.newSentenceStarted {
        Task { @MainActor in
            transcribe(result.bestTranscription.lastClosedSentence)
        }
    }
}

lastClosedSentence contains segments that possibly construct together a sentence. I am using that together with LLM apis, and it works quite well as these api's require some meaningful chunks of utterance.

Upvotes: 1

Wesley
Wesley

Reputation: 5621

I have a different approach that I find far more reliable in determining when the recognitionTask is done guessing: the confidence score.

When shouldReportPartialResults is set to true, the partial results will have a confidence score of 0.0. Only the final guess will come back with a score over 0.

recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in

    if let result = result {
        let confidence = result.bestTranscription.segments[0].confidence
        print(confidence)
        self.transcript = result.bestTranscription.formattedString
    }

}

The segments array above contains each word in the transcription. 0 is the safest index to examine, so I tend to use that one.

How you use it is up to you, but if all you want to do is know when the guesser is done guessing, you can just call:

let myIsFinal = confidence > 0.0 ? true : false

You can also look at the score (100.0 is totally confident) and group responses into groups of low -> high confidence guesses as well if that helps your application.

Upvotes: 7

fromlucknow
fromlucknow

Reputation: 31

if result != nil {
    self.timerDidFinishTalk.invalidate()
    self.timerDidFinishTalk = Timer.scheduledTimer(timeInterval: TimeInterval(self.listeningTime), target: self, selector:#selector(self.didFinishTalk), userInfo: nil, repeats: false)

    let bestString = result?.bestTranscription.formattedString

    self.fullsTring =  bestString!.trimmingCharacters(in: .whitespaces)
    self.st = self.fullsTring
  }

Here self.listeningTime is the time after which you want to stop after getting end of the utterance.

Upvotes: 1

Zebra
Zebra

Reputation: 113

Based on my test on iOS10, when shouldReportPartialResults is set to false, you have to wait 60 seconds to get the result.

Upvotes: 4

Alan
Alan

Reputation: 1142

I am using Speech to text in an app currently and it is working fine for me. My recognitionTask block is as follows:

recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in
        var isFinal = false

        if let result = result, result.isFinal {
            print("Result: \(result.bestTranscription.formattedString)")
            isFinal = result.isFinal
            completion(result.bestTranscription.formattedString, nil)
        }

        if error != nil || isFinal {
            self.audioEngine.stop()
            inputNode.removeTap(onBus: 0)

            self.recognitionRequest = nil
            self.recognitionTask = nil
            completion(nil, error)
        }
    })

Upvotes: 2

Joe Aspara
Joe Aspara

Reputation: 1197

It seems that isFinal flag doesn't became true when user stops talking as expected. I guess this is a wanted behaviour by Apple, because the event "User stops talking" is an undefined event.

I believe that the easiest way to achieve your goal is to do the following:

  • You have to estabilish an "interval of silence". That means if the user doesn't talk for a time greater than your interval, he has stopped talking (i.e. 2 seconds).

  • Create a Timer at the beginning of the audio session:

var timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTalk", userInfo: nil, repeats: false)

  • when you get new transcriptions in recognitionTaskinvalidate and restart your timer

    timer.invalidate() timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTalk", userInfo: nil, repeats: false)

  • if the timer expires this means the user doesn't talk from 2 seconds. You can safely stop Audio Session and exit

Upvotes: 28

Related Questions