How can I get begin time and end time for the conversion result of stream audio?

Question

I am using azure-speech to recognize audio stream, from speech_recognition_samples.cpp, from class RecognitionResult I only can get the Text and m_duration, but how can I get the begin time and end time of the result in the speech? I know e.Result->Offset() can return the offset, but I still confused about it, My code is

void recognizeSpeech() {
std::shared_ptr config = SpeechConfig::FromSubscription("****", "****");
config->RequestWordLevelTimestamps();
auto pushStream = AudioInputStream::CreatePushStream();
std::cout << "created push
" << std::endl;
auto audioInput = AudioConfig::FromStreamInput(pushStream);
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);

promise recognitionEnd;

recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
{
    cout << "Recognizing:" << e.Result->Text << std::endl
            << "  Offset=" << e.Result->Offset() << std::endl
            << "  Duration=" << e.Result->Duration() << std::endl;
});

recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
{
    if (e.Result->Reason == ResultReason::RecognizedSpeech)
    {
        cout << "RECOGNIZED: Text=" << e.Result->Text << std::endl
            << "  Offset=" << e.Result->Offset() << std::endl
            << "  Duration=" << e.Result->Duration() << std::endl;
    }
    else if (e.Result->Reason == ResultReason::NoMatch)
    {
        cout << "NOMATCH: Speech could not be recognized." << std::endl;
    }
});

recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
{
    switch (e.Reason)
    {
    case CancellationReason::EndOfStream:
        cout << "CANCELED: Reach the end of the file." << std::endl;
        break;

    case CancellationReason::Error:
        cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << std::endl;
        cout << "CANCELED: ErrorDetails=" << e.ErrorDetails << std::endl;
        recognitionEnd.set_value();
        break;

    default:
        cout << "CANCELED: received unknown reason." << std::endl;
    }

});

recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
{
    cout << "Session stopped.";
    recognitionEnd.set_value(); // Notify to stop recognition.
});

WavFileReader reader(FILE_NAME);

vector buffer(1000);
recognizer->StartContinuousRecognitionAsync().wait();

int readSamples = 0;
while((readSamples = reader.Read(buffer.data(), (uint32_t)buffer.size())) != 0)
{
    pushStream->Write(buffer.data(), readSamples);
}

pushStream->Close();
recognitionEnd.get_future().get();

recognizer->StopContinuousRecognitionAsync().get();
}

the result is

 Recognizing:my
 Offset=6800000
 Duration=2700000
 Recognizing:my voice is
 Offset=6800000
 Duration=8500000
 Recognizing:my voice is my
 Offset=6800000
 Duration=9800000
 Recognizing:my voice is my passport
 Offset=6800000
 Duration=14400000
 Recognizing:my voice is my passport verify me
 Offset=6800000
 Duration=26100000
 RECOGNIZED: Text=My voice is my passport, verify me.
  Offset=6800000
 Duration=28100000
 CANCELED: Reach the end of the file.

Why the offset of result each time is always 6800000? I think It should be increasing continuously, such as : the begin offset of "my" is 0, and the end offset of "my" is 100000, the begin offset of "my voice is" is 0 and the end offset of "my voice is" 200000. Then I can get the begin time and the end time of "my voice is" in the sentence.But now how can I get the begin time and the end time in the sentence of each result?

Satya V · Accepted Answer

If you look closely in your outputs - there are two events :

Recognizing and Recognized

Recognizing : The event Recognizing signals that are an intermediate recognition result is received.

Recognized : The event Recognized signals that a final recognition result is received.

So the offset that you see is for the complete sentence (Recognized event - usually before the first pause) : My voice is my passport, verify me. So for all the recognizing (intermediate) event, the offset will be same. So if you had another recognized event, you would see sequential offset. So if you had another sentence in the audio - you are likely to see additional recognized event and the offset - growing like you are expecting.

Update :

Additional Note : The duration grows from zero for every recognized event. The duration count traverses from zero to duration of the complete recognized event.

So for instance

Recognizing:my
 Offset=6800000
 Duration=2700000
 Recognizing:my voice is
 Offset=6800000
 Duration=8500000

So if you want an offset for : My Voice is - you could add the intial offset and duration of the previous one 6800000 + 2700000 ( begin time) and end time would be 6800000 + 8500000 (the current duration)

Update 2

RECOGNIZED: Text=My voice is my passport, verify me.
  Offset=6800000
 Duration=28100000

They are in 100 nano seconds( 10 ^-7 Seconds)

so let us take your case

your offset is 6800000 which is 0.68 second

So that mean the sentence(or complete recognized event for that audio stream) has been started at 0.68th second of the entire audio.

The duration or complete time taken for to utter ("My voice is my passport, verify me.") is 2.8(28100000) seconds.

The offset of the 2nd sentence (recognized event) would be greater than this duration.

The duration can be either less than or greater than offset :

In the 3rd second of the entire audio, I can utter 4 second long audio stream without a pause.

The offset would be 3 seconds and duration would be 4 seconds

How can I get begin time and end time for the conversion result of stream audio?

Answers (1)

Related Questions