Reputation: 19
One of the features in my app is to record a conversation between two individuals (this could be at most 3) and then use the Google Speech To Text version v1p1beta1
to obtain a diarized version of the speech contained in that recording.
The specifics: The audio recording is done on the client side using this code:
navigator.mediaDevices.getUserMedia({ audio: true, video: false})
.then(function(stream) {
userMediaStreamAOnly = stream;
})
...and...
vcAudioOnlyRecorder = new MediaRecorder(userMediaStreamAOnly, { mimeType: 'audio/webm' });
The above code produces a Base64 encoded string whose media type is "audio/webm". So far so good.
Following are the values assigned to RecognitionConfig
and RecognitionAudio
:
joConfig = new JSONObject();
joConfig.put("encoding", AudioEncoding.ENCODING_UNSPECIFIED_VALUE);
joConfig.put("sampleRateHertz", 48000);
joConfig.put("languageCode", languageCode);
joConfig.put("enableWordTimeOffsets", true);
joConfig.put("enableAutomaticPunctuation", true);
jsonObjectDiarizationConfig = new JSONObject();
jsonObjectDiarizationConfig.put("enableSpeakerDiarization", Boolean.TRUE);
jsonObjectDiarizationConfig.put("minSpeakerCount", 2);
jsonObjectDiarizationConfig.put("maxSpeakerCount", 3);
joConfig.put("diarizationConfig", jsonObjectDiarizationConfig);
jsonObjectRecognitionMetadata = new JSONObject();
jsonObjectRecognitionMetadata.put("interactionType", InteractionType.DISCUSSION);
jsonObjectRecognitionMetadata.put("industryNaicsCodeOfAudio", 512290);
jsonObjectRecognitionMetadata.put("microphoneDistance", MicrophoneDistance.NEARFIELD_VALUE);
jsonObjectRecognitionMetadata.put("originalMediaType", OriginalMediaType.AUDIO);
jsonObjectRecognitionMetadata.put("recordingDeviceType", RecordingDeviceType.PC);
jsonObjectRecognitionMetadata.put("audioTopic", "Test");
joConfig.put("metadata", jsonObjectRecognitionMetadata);
joConfig.put("model", "default");
joConfig.put("useEnhanced", Boolean.TRUE);
joAudio = new JSONObject();
joAudio.put("content", Base64AudioContents);
joPayLoad = new JSONObject();
joPayLoad.put("config", joConfig);
joPayLoad.put("audio", joAudio);
I then invoke this URL:
https://speech.googleapis.com/v1/speech:longrunningrecognize?key=
to trigger the speech-to-text process.
I have the Base64 string. Just not sure how to share it here on stackoverflow.
In return, I get an instance of operation that (somewhere deep down inside the data structure) contains instances of words.
Now here is the problem: Sometimes, the instances of words
contains two speaker
tags, namely, 1 and 2, and accurately contain the words spoken by the respective users.
And sometimes, the text (corresponding to the words spoken by the two speakers) is attributed just to one speaker, i.e., the instances of words
contain only one speaker
tag, namely, 1.
Question 1: Is this an issue with the quality of my microphone? Note that I am using the microphone that came built in with my lap-top.
Question 2: Should I pass lossless audio to the URL? In which case, I am not sure if MediaDevices supports a lossless format, yet.
Question 3: Is this a technology whose current level of maturity is such that these anomalies are only to be expected?
Comment: I am perturbed by the consistently inconsistent speaker tags in the output. Any help will be highly appreciated.
Upvotes: 0
Views: 142
Reputation: 98485
1: Is this an issue with the quality of my microphone? Note that I am using the microphone that came built in with my lap-top.
Record the audio from that micrphone, play it back through decent headphones (nothing below $20 would qualify, unless it was a good secondhand deal), and hear for yourself :) It may well be a problem not with the microphone but with the room reflections, background noise, etc.
2: Should I pass lossless audio to the URL? In which case, I am not sure if MediaDevices supports a lossless format, yet.
You must have some control over the compression/bitrate of audio from the microphone. Last time I played with video in Chrome, the defaults were terrible, so I'm not sure what they are for audio. YMMV. You must know exactly what the parameters are, though.
3: Is this a technology whose current level of maturity is such that these anomalies are only to be expected?
You get what you pay for, too. If you want something with technical support and someone you can call who stands behind the product, you will do well to talk to a company whose sole focus is this technology, and not, for example, turning you into a product :)
The maturity level of the technology is IMHO adequate for most tasks, but who you get the technology from matters a whole lot.
Upvotes: 1