Reputation: 641
I am using azure-speech to recognize audio stream, from speech_recognition_samples.cpp, from class RecognitionResult I only can get the Text and m_duration, but how can I get the begin time and end time of the result in the speech? I know e.Result->Offset()
can return the offset, but I still confused about it, My code is
void recognizeSpeech() {
std::shared_ptr<SpeechConfig> config = SpeechConfig::FromSubscription("****", "****");
config->RequestWordLevelTimestamps();
auto pushStream = AudioInputStream::CreatePushStream();
std::cout << "created push\n" << std::endl;
auto audioInput = AudioConfig::FromStreamInput(pushStream);
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);
promise<void> recognitionEnd;
recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
{
cout << "Recognizing:" << e.Result->Text << std::endl
<< " Offset=" << e.Result->Offset() << std::endl
<< " Duration=" << e.Result->Duration() << std::endl;
});
recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
{
if (e.Result->Reason == ResultReason::RecognizedSpeech)
{
cout << "RECOGNIZED: Text=" << e.Result->Text << std::endl
<< " Offset=" << e.Result->Offset() << std::endl
<< " Duration=" << e.Result->Duration() << std::endl;
}
else if (e.Result->Reason == ResultReason::NoMatch)
{
cout << "NOMATCH: Speech could not be recognized." << std::endl;
}
});
recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
{
switch (e.Reason)
{
case CancellationReason::EndOfStream:
cout << "CANCELED: Reach the end of the file." << std::endl;
break;
case CancellationReason::Error:
cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << std::endl;
cout << "CANCELED: ErrorDetails=" << e.ErrorDetails << std::endl;
recognitionEnd.set_value();
break;
default:
cout << "CANCELED: received unknown reason." << std::endl;
}
});
recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
{
cout << "Session stopped.";
recognitionEnd.set_value(); // Notify to stop recognition.
});
WavFileReader reader(FILE_NAME);
vector<uint8_t> buffer(1000);
recognizer->StartContinuousRecognitionAsync().wait();
int readSamples = 0;
while((readSamples = reader.Read(buffer.data(), (uint32_t)buffer.size())) != 0)
{
pushStream->Write(buffer.data(), readSamples);
}
pushStream->Close();
recognitionEnd.get_future().get();
recognizer->StopContinuousRecognitionAsync().get();
}
the result is
Recognizing:my
Offset=6800000
Duration=2700000
Recognizing:my voice is
Offset=6800000
Duration=8500000
Recognizing:my voice is my
Offset=6800000
Duration=9800000
Recognizing:my voice is my passport
Offset=6800000
Duration=14400000
Recognizing:my voice is my passport verify me
Offset=6800000
Duration=26100000
RECOGNIZED: Text=My voice is my passport, verify me.
Offset=6800000
Duration=28100000
CANCELED: Reach the end of the file.
Why the offset of result each time is always 6800000? I think It should be increasing continuously, such as : the begin offset of "my" is 0, and the end offset of "my" is 100000, the begin offset of "my voice is" is 0 and the end offset of "my voice is" 200000. Then I can get the begin time and the end time of "my voice is" in the sentence.But now how can I get the begin time and the end time in the sentence of each result?
Upvotes: 0
Views: 515
Reputation: 4174
If you look closely in your outputs - there are two events :
Recognizing and Recognized
Recognizing : The event Recognizing signals that are an intermediate recognition result is received.
Recognized : The event Recognized signals that a final recognition result is received.
So the offset that you see is for the complete sentence (Recognized event - usually before the first pause) : My voice is my passport, verify me. So for all the recognizing (intermediate) event, the offset will be same. So if you had another recognized event, you would see sequential offset. So if you had another sentence in the audio - you are likely to see additional recognized event and the offset - growing like you are expecting.
Update :
Additional Note : The duration grows from zero for every recognized event. The duration count traverses from zero to duration of the complete recognized event.
So for instance
Recognizing:my
Offset=6800000
Duration=2700000
Recognizing:my voice is
Offset=6800000
Duration=8500000
So if you want an offset for : My Voice is
- you could add the intial offset and duration of the previous one 6800000 + 2700000 ( begin time) and end time would be 6800000 + 8500000 (the current duration)
Update 2
RECOGNIZED: Text=My voice is my passport, verify me.
Offset=6800000
Duration=28100000
They are in 100 nano seconds( 10 ^-7 Seconds)
so let us take your case
your offset is 6800000 which is 0.68 second
So that mean the sentence(or complete recognized event for that audio stream) has been started at 0.68th second of the entire audio.
The duration or complete time taken for to utter ("My voice is my passport, verify me.") is 2.8(28100000) seconds.
The offset of the 2nd sentence (recognized event) would be greater than this duration.
The duration can be either less than or greater than offset :
In the 3rd second of the entire audio, I can utter 4 second long audio stream without a pause.
The offset would be 3 seconds and duration would be 4 seconds
Upvotes: 1