Reputation: 12708
I am building an Alexa Skill using AWS Lambda and NodeJS. I have two questions:
1) Is it possible for me to retrieve the full transcript of the speaker?
In my Alexa phone app, I'm able to read exactly what I've spoken, but I'd like to collect this data so I can possibly analyze how people are speaking to my Skill.
This is possible with Speech-to-text tools like Google Speech APIs (demo here, spec here), with things like recognition.onresult()
:
recognition.onresult = function(event) {
var interim_transcript = '';
for (var i = event.resultIndex; i < event.results.length; ++i) {
if (event.results[i].isFinal) {
final_transcript += event.results[i][0].transcript;
In my Alexa app, you can see here it captured when I asked "sing happy birthday":
How can I programmatically capture this? I'd like to know when a user asks for things that I haven't thought of, collect these failures and common speech requests, and improve the skill based on it.
2) Does Alexa support multiple voices and multiple languages (input and output)?
Again, looking at Google Speech APIs, you can see it allows for many modifications to Speech input and Speech output, with multi-languages, and even speech rate:
var utterance = new SpeechSynthesisUtterance();
utterance.rate = 0.7;
utterance.lang = "zh-CN";
Does Alexa offer this suite of controls?
Upvotes: 6
Views: 1846
Reputation: 1014
Use this hack created by my colleague Bryan Colligan.
The hack uses slot type CONTENT_LIST
with "value": "all"
to capture any word. By creating sample utterances which include multiple capture all slots for example "{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX}"
you can capture sentences of varying length with relative ease.
Note: In my experience Amazon's "Search Query" is limited to 5-6 words.
Warning: Amazon's transcriptions are pretty bad, so don't be surprised if what you capture is somewhat unreadable. This shortcoming is likely one reason Amazon does not reveal its transcripts. Google is much further ahead in Voice to Text. I'm sure in the future Amazon will release the transcripts when they feel more comfortable with their technology.
The following code will concatenate multiple slots. It can be placed in your lambda function.
let querySentance = '';
let wordSlots = ["WordI", "WordII", "WordIII", "WordIV", "WordV", "WordVI", "WordVII", "WordVIII", "WordIX", "WordX", "WordXI", "WordXII", "WordXIII", "WordXIV", "WordXV", "WordXVI", "WordXVII", "WordXVIII", "WordIXX", "WordXX", "WordXXI", "WordXXII", "WordXXIII", "WordXXIV", "WordXXV", "WordXXVI", "WordXXVII", "WordXXVIII", "WordIXXX", "WordXXX",];
wordSlots.forEach((word)=>{
let slot = this.event.request.intent.slots[word];
if (slot !== undefined && slot.value !== '' && slot.value !== '?' && slot.value !== null && slot.value !== undefined){
querySentance = querySentance+' '+slot.value;
}
});
The following Interaction Model uses CONTENT_LIST
and "value": "all"
to capture any word.
{
"interactionModel": {
"languageModel": {
"invocationName": "alpha voice",
"intents": [
{
"name": "AMAZON.CancelIntent",
"samples": [
"cancel"
]
},
{
"name": "AMAZON.HelpIntent",
"samples": [
"help"
]
},
{
"name": "AMAZON.StopIntent",
"samples": [
"stop"
]
},
{
"name": "OzIntent",
"slots": [
{
"name": "Query",
"type": "AMAZONSearchQuery"
},
{
"name": "WordI",
"type": "CONTENT_LIST"
},
{
"name": "WordII",
"type": "CONTENT_LIST"
},
{
"name": "WordIII",
"type": "CONTENT_LIST"
},
{
"name": "WordIV",
"type": "CONTENT_LIST"
},
{
"name": "WordV",
"type": "CONTENT_LIST"
},
{
"name": "WordVI",
"type": "CONTENT_LIST"
},
{
"name": "WordVII",
"type": "CONTENT_LIST"
},
{
"name": "WordVIII",
"type": "CONTENT_LIST"
},
{
"name": "WordIX",
"type": "CONTENT_LIST"
},
{
"name": "WordX",
"type": "CONTENT_LIST"
},
{
"name": "WordXI",
"type": "CONTENT_LIST"
},
{
"name": "WordXII",
"type": "CONTENT_LIST"
},
{
"name": "WordXIII",
"type": "CONTENT_LIST"
},
{
"name": "WordXIV",
"type": "CONTENT_LIST"
},
{
"name": "WordXV",
"type": "CONTENT_LIST"
},
{
"name": "WordXVI",
"type": "CONTENT_LIST"
},
{
"name": "WordXVII",
"type": "CONTENT_LIST"
},
{
"name": "WordXVIII",
"type": "CONTENT_LIST"
},
{
"name": "WordIXX",
"type": "CONTENT_LIST"
},
{
"name": "WordXX",
"type": "CONTENT_LIST"
},
{
"name": "WordXXI",
"type": "CONTENT_LIST"
},
{
"name": "WordXXII",
"type": "CONTENT_LIST"
},
{
"name": "WordXXIII",
"type": "CONTENT_LIST"
},
{
"name": "WordXXIV",
"type": "CONTENT_LIST"
},
{
"name": "WordXXV",
"type": "CONTENT_LIST"
},
{
"name": "WordXXVI",
"type": "CONTENT_LIST"
},
{
"name": "WordXXVII",
"type": "CONTENT_LIST"
},
{
"name": "WordXXVIII",
"type": "CONTENT_LIST"
},
{
"name": "WordIXXX",
"type": "CONTENT_LIST"
},
{
"name": "WordXXX",
"type": "CONTENT_LIST"
}
],
"samples": [
"{WordI}",
"{WordI} {WordII}",
"{WordI} {WordII} {WordIII}",
"{WordI} {WordII} {WordIII} {WordIV}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII} {WordXXIII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII} {WordXXIII} {WordXXIV}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII} {WordXXIII} {WordXXIV} {WordXXV}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII} {WordXXIII} {WordXXIV} {WordXXV} {WordXXVI}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII} {WordXXIII} {WordXXIV} {WordXXV} {WordXXVI} {WordXXVII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII} {WordXXIII} {WordXXIV} {WordXXV} {WordXXVI} {WordXXVII} {WordXXVIII}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII} {WordXXIII} {WordXXIV} {WordXXV} {WordXXVI} {WordXXVII} {WordXXVIII} {WordIXXX}",
"{WordI} {WordII} {WordIII} {WordIV} {WordV} {WordVI} {WordVII} {WordVIII} {WordIX} {WordX} {WordXI} {WordXII} {WordXIII} {WordXIV} {WordXV} {WordXVI} {WordXVII} {WordXVIII} {WordIXX} {WordXX} {WordXXI} {WordXXII} {WordXXIII} {WordXXIV} {WordXXV} {WordXXVI} {WordXXVII} {WordXXVIII} {WordIXXX} {WordXXX}"
]
},
{
"name": "AMAZON.NavigateHomeIntent",
"samples": [
"navigate home"
]
}
],
"types": [
{
"name": "AMAZONSearchQuery",
"values": [
{
"name": {
"value": "all"
}
}
]
},
{
"name": "CONTENT_LIST",
"values": [
{
"name": {
"value": "all"
}
}
]
}
]
}
}
}
Note: I use this code as a capture all for my skill. It's the only intent. If you're looking to have other intents so that this intent can detect utterances that fall through I'd recommend experimenting. Create an intent with defined utterances and see if Amazon will pick it before falling back on this free form capture.
Please comment below if you have success and I'll update the answer.
Upvotes: 4
Reputation: 196
An updated answer:
Q1: Still not possible to get the audio. But you can use the Built-in Slot like AMAZON.SearchQuery to get values you haven't specified.
Q2: Now you can use different voices in your skill by using the voice
tag in SSML like this:
<voice name="Kendra"><lang xml:lang="en-US">I want to tell you a secret.</lang></voice><voice name="Brian"><lang xml:lang="en-GB">Your secret is safe with me!</lang></voice>
The following voices are supported for their respective languages:
English, American (en-US): Ivy, Joanna, Joey, Justin, Kendra, Kimberly, Matthew, Salli
English, Australian (en-AU): Nicole, Russell
English, British (en-GB): Amy, Brian, Emma
English, Indian (en-IN): Aditi, Raveena
German (de-DE): Hans, Marlene, Vicki
Spanish, Castilian (es-es): Conchita, Enrique
Italian (it-IT): Carla, Giorgio
Japanese (ja-JP): Mizuki, Takumi
French (fr-FR): Celine, Lea, Mathieu
Upvotes: 0
Reputation: 499
Question 1:
Not currently. According to the request syntax, the audio clip is not provided to your service's endpoint. Alternatively, if you were providing the hardware, and leveraging the Alexa Voice Service, then you would be capturing the Audio.
Question 2:
Not currently. Alexa seems to only support English
Upvotes: 3