Google Speech To Text returns only 1 speaker although the conversation has 2 humans speaking to each other

Question

One of the features in my app is to record a conversation between two individuals (this could be at most 3) and then use the Google Speech To Text version v1p1beta1 to obtain a diarized version of the speech contained in that recording.

The specifics: The audio recording is done on the client side using this code:

navigator.mediaDevices.getUserMedia({ audio: true, video: false})
    .then(function(stream) {
        userMediaStreamAOnly    =   stream;
    })

...and...

vcAudioOnlyRecorder =   new MediaRecorder(userMediaStreamAOnly, { mimeType: 'audio/webm' });

The above code produces a Base64 encoded string whose media type is "audio/webm". So far so good. Following are the values assigned to RecognitionConfig and RecognitionAudio:

joConfig    =   new JSONObject();
joConfig.put("encoding", AudioEncoding.ENCODING_UNSPECIFIED_VALUE);
joConfig.put("sampleRateHertz", 48000);
joConfig.put("languageCode", languageCode);
joConfig.put("enableWordTimeOffsets", true);
joConfig.put("enableAutomaticPunctuation", true);
                    
jsonObjectDiarizationConfig     =   new JSONObject();
jsonObjectDiarizationConfig.put("enableSpeakerDiarization", Boolean.TRUE);
jsonObjectDiarizationConfig.put("minSpeakerCount", 2);
jsonObjectDiarizationConfig.put("maxSpeakerCount", 3);
joConfig.put("diarizationConfig", jsonObjectDiarizationConfig);
                    
jsonObjectRecognitionMetadata   =   new JSONObject();
jsonObjectRecognitionMetadata.put("interactionType", InteractionType.DISCUSSION);
jsonObjectRecognitionMetadata.put("industryNaicsCodeOfAudio", 512290);
jsonObjectRecognitionMetadata.put("microphoneDistance", MicrophoneDistance.NEARFIELD_VALUE);
jsonObjectRecognitionMetadata.put("originalMediaType", OriginalMediaType.AUDIO);
jsonObjectRecognitionMetadata.put("recordingDeviceType", RecordingDeviceType.PC);
jsonObjectRecognitionMetadata.put("audioTopic", "Test");
joConfig.put("metadata", jsonObjectRecognitionMetadata);
                        
joConfig.put("model", "default");
joConfig.put("useEnhanced", Boolean.TRUE);
                    
joAudio     =   new JSONObject();
joAudio.put("content", Base64AudioContents);
                        
joPayLoad   =   new JSONObject();
joPayLoad.put("config", joConfig);
joPayLoad.put("audio", joAudio);

I then invoke this URL:

https://speech.googleapis.com/v1/speech:longrunningrecognize?key=

to trigger the speech-to-text process.

I have the Base64 string. Just not sure how to share it here on stackoverflow.

In return, I get an instance of operation that (somewhere deep down inside the data structure) contains instances of words.

Now here is the problem: Sometimes, the instances of words contains two speaker tags, namely, 1 and 2, and accurately contain the words spoken by the respective users. And sometimes, the text (corresponding to the words spoken by the two speakers) is attributed just to one speaker, i.e., the instances of words contain only one speaker tag, namely, 1.

Question 1: Is this an issue with the quality of my microphone? Note that I am using the microphone that came built in with my lap-top.

Question 2: Should I pass lossless audio to the URL? In which case, I am not sure if MediaDevices supports a lossless format, yet.

Question 3: Is this a technology whose current level of maturity is such that these anomalies are only to be expected?

Comment: I am perturbed by the consistently inconsistent speaker tags in the output. Any help will be highly appreciated.

Google Speech To Text returns only 1 speaker although the conversation has 2 humans speaking to each other

Answers (1)

Related Questions