Speech recognition
You can recognize live speech during an active call or conference.
Contents
There are two modes of speech recognition in Voximplant. The Phrase hint mode recognizes user input from the list of preset phrases and can be useful in IVRs and dialogs with voice interaction. The Freeform mode recognizes all the user input and transcribes it to text, which can be useful in transcribing human conversation into a text file.
You can use various speech recognition engines such as Google, Amazon, Microsoft, Yandex, or Tinkoff. You can find the profile names here.
The Phrase hint mode is supported by Google profile only. Google does not limit the results to the specified list, but the words on the list you specify have a higher chance of being selected.
Speech recognition usage
- Import the ASR module into your scenario via the require method:
require(Modules.ASR)
- Use the VoxEngine.createASR method to create an ASR object.
- Subscribe to the ASR object events like ASREvents.Result.
- Send media from a call object to the ASR object via the sendMediaTo method
- Receive the results via events
If you want to use the Phrase hint mode for speech recognition, specify the phraseHints parameter after the speech recognition profile. See the following code example to understand how it works:
The Phrase hint mode is supported by Google profile only.
If you want to use the Freeform mode, the Result event is triggered every time the voice is recognized. There is always a delay between capture and recognition, so plan user interaction accordingly.
Use the following code example to understand how this mode works:
Transcribing a call
Use the record method to transcribe a call or a conference into a text file. Set the transcribe parameter to true and the language parameter to one of the supported languages.
See the following code example to understand how it works:
Unlike audio and video recording, transcription results are available only after a call ends, so they should be retrieved via the GetCallHistory method of the HTTP API. You have to call this method with the with_records=true
parameter specified.
There are records in the response JSON with the transcription_url field. This field value returns the transcription as a plain text:
Each line in the transcription file is prefixed "Left" for an audio stream from a call endpoint to the Voximplant cloud, and "Right" for an audio stream from the Voximplant cloud to a call endpoint (same logic as with left and right audio channel for stereo recording).
"Left" and "Right" names can be changed via the labels parameter. The dict parameter allows you to specify an array of words that the transcriber tries to match in case of recognition problems. Specifying domain-specific words can improve transcription results a lot.
Google's beta STT features usage
Since Google provides access to their Speech API v1p1beta1 features, we support it as well.
Currently, Voximplant speech recognition supports the following features:
enableSeparateRecognitionPerChannel
alternativeLanguageCodes
enableWordTimeOffsets
enableWordConfidence
enableAutomaticPunctuation
diarizationConfig
metadata
To use the features, you have to set the beta parameter to true when creating an ASR instance:
Refer to the API reference to learn about the parameters.
Here is what the request's result looks like:
With Google beta features, the ASREvents.result event updates with the resultEndTime, channelTag, and languageCode properties. You can see them in the session logs.
Passing parameters directly to the provider
There are two ways of passing speech recognition parameters to your provider. You can fill the ASRParameters parameters on the Voximplant side, as it is explained in this article, so the platform converts them to the provider's format and sends them to your provider. Alternatively, you can provide the parameters directly to the provider in the request parameter.
You need to specify the parameters in the specific format that your provider accepts. Different providers use different formats. Refer to your provider's API reference to learn about the formats.
Here are examples of the request parameter for the most common providers:
Emotions and gender recognition
Some voice recognition providers support additional features, such as emotions and gender recognition. These features depend on an ASR provider and work differently. Refer to your provider's API to learn if it supports a certain feature.
For example, Tinkoff ASR supports gender recognition, and SaluteSpeech ASR supports emotions recognition.
To receive this information, send the request to ASR in the request parameter, as it is described in the Passing parameters directly to the provider section of this article.
Let us request emotion recognition from SaluteSpeech. Pass the emotions_result
in the request parameter as shown below.
In return, SaluteSpeech provides emotions result in addition to speech recognition result:
{
// other parameters
"response": {
"emotionsResult": {
"negative": 0.003373496,
"neutral": 0.996082962,
"positive": 0.000543531496
}
},
"text": "check one two three"
}
Let us request gender recognition from Tinkoff in the same way. Pass the enable_gender_identification
parameter in the request parameter to request it:
In return, Tinkoff provides gender identification probability in addition to speech recognition result:
{
// other parameters
"response": {
"results": [
{
"recognitionResult": {
"genderIdentificationResult": {
"femaleProba": 0.831116796,
"maleProba": 0.168883175
}
}
}
]
},
"text": "check one two three"
}