IBM Cloud Speech-to-Text: Two Implementations

Image for post
Image for post

Three years ago I implemented IBM Cloud Speech-to-Text into my web app LanguageTwo. It was the worst part of my app. The streaming WebSockets connection was unreliable at best. I’ve spent three weeks fixing it.

My project uses AngularJS and Firebase.

My first plan was to discontinue the WebSockets streaming and instead record each audiofile, save it to the database, send the file to the IBM Cloud for processing, and then send the response to the browser to display to the user.

I spent a week figuring out how to set up the browser to record from the microphone. It turns out that there’s an old way, using Navigator.getUserMedia(), and a new way using MediaDevices.getUserMedia(). There are tons of libraries that use the old way, badly. I had to dig through those before I found the new way. The new way is easy. First, in the template I created two buttons (Bootstrap) to start and stop the recording, and room for the results.

<div class="col-sm-2 col-md-2 col-lg-2" ng-show="nativeLanguage === 'en-US'">
<button type="button" class="btn btn-block btn-default" ng-click="startWatsonSpeechToText()" uib-tooltip="Wait for 'Listening'">Start pronunciation</button>
</div>
<div class="col-sm-2 col-md-2 col-lg-2" ng-show="nativeLanguage === 'en-US'">
<button type="button" class="btn btn-block btn-default" ng-click="stopWatsonSpeechToText()">Stop pronunciation</button>
</div>
<div class="col-sm-6 col-md-6 col-lg-6" ng-class="speechToTextResults">
<h3>{{normalizedText}}&nbsp;&nbsp;&nbsp;{{confidence}}</h3>
</div>
```

Here’s the controller handler function:

$scope.getMicrophone = function() {navigator.mediaDevices.getUserMedia({ audio: true, video: false })
.then(stream => {
console.error("Error");
}
var options = {
audioBitsPerSecond: 16000,
mimeType: 'audio/webm;codecs=opus'
};
const mediaRecorder = new MediaRecorder(stream, options);
mediaRecorder.start();
const audioChunks = [];mediaRecorder.addEventListener("dataavailable", event => {
audioChunks.push(event.data);
});
mediaRecorder.addEventListener("stop", () => {
const audioBlob = new Blob(audioChunks);
// upload to Firebase Storage
firebase.storage().ref('Users/' + $scope.user.uid + '/Pronunciation_Test').put(audioBlob)
.then(function(snapshot) {
firebase.storage().ref(snapshot.ref.location.path).getDownloadURL()
.then(function(url) { // get downloadURL
firebase.firestore().collection('Users').doc($scope.user.uid).collection("Pronunciation_Test").doc('downloadURL').set({
downloadURL: url,
keywords: keywordArray,
model: watsonSpeechModel,
})
.then(function() {
console.log("Document successfully written!");
})
.catch(function(error) {
console.error("Error writing document: ", error);
});
})
.catch(error => console.error(error));
})
.catch(error => console.error(error));
// play back the audio blob
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();
});
// click to stop
$scope.stopMicrophone = function() {
$scope.normalizedText = "Waiting for IBM Cloud Speech-to-Text"; // displays this text
$scope.confidence = "0.00"; // displays this confidence level
mediaRecorder.stop();
}; // end $scope.stopMicrophone
})
.catch(function(error) {
console.log(error.name + ": " + error.message);
});
};

The first line is the AngularJS handler function for the start button.

The first code block gets the microphone and creates a stream from the microphone. This implementation requires mastering streaming. “Old school” code was algorithms and data structures; jobs interviews sometimes included solving a chess problem. Coding is now about time: handling streams of data and making chains of asynchronous API calls.

The next code block sets the `options` object. This sets three options, in two properties: the bits per second, the media type, and the encoding format. Chrome has many video media types and encoding formats, but only a single audio media type—webm—and a single encoding format—opus. You get to select the bits per second. Reading the IBM Cloud Speech-to-Text documentation I saw that it has two modes, Broadband and Narrowband. The former downsamples everything to 16,000 bits per second; the latter downsamples everything to 8,000 bps. There’s no reason to record more than 16,000 bps.

The next code block takes the stream and the options and starts themediaRecorder. The mediaRecorder is part of the getUserMedia package.

Next, we make an array for audioChunks and push the streaming data into the array.

When the user clicks the stop button the audioChunks are smushed together into a blob file.

The blob is uploaded to the Firebase Storage database. This is a database for large digital files such as photos, video, and audio. The database then returns a downloadURL.

The downloadURL is written to the Firebase Firestore database. This is a NoSQL database that handles documents and collections (objects and arrays). We also write a keywords array and the model, e.g., en-US_BroadbandModel.

Then there’s a code block that plays back the recording to the user.

Finally, I have an AngularJS handler function for the stop button. Note that I’ve nested the two AngularJS handler functions, start and stop.

Now we have the audio file stored in the cloud database. I used a Firebase Cloud Function, also known as a Google Cloud Function, to send the audio file to IBM Cloud Speech-to-Text:

exports.IBM_Speech_to_Text = functions.firestore.document('Users/{userID}/Pronunciation_Test/downloadURL').onUpdate((change, context) => {
const axios = require('axios'); // library for HTTP requests
const SpeechToTextV1 = require('ibm-watson/speech-to-text/v1');
const { IamAuthenticator } = require('ibm-watson/auth');
const speechToText = new SpeechToTextV1({
authenticator: new IamAuthenticator({
apikey: 's00pers3cret',
}),
url: 'https://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/01010101',
});
const downloadURL = change.after.data().downloadURL.toString();
const keywords = change.after.data().keywords;
const model = change.after.data().model;
const userID = context.params.userID;
return axios({
method: 'get',
url: downloadURL,
responseType: 'stream', // createReadStream is included in axios
})
.then(function (response) {
var params = {
audio: response.data, // response is already a stream, no need for fs.createReadStream
contentType: 'application/octet-stream', // media file type/encoding; can be omitted, with 'application/octet-stream' the service automatically detect the format of the audio; consider using 'audio/webm;codecs=opus'
wordAlternativesThreshold: 0.9,
max_alternatives: 3,
keywords: keywords, // pass in from browser
keywordsThreshold: 0.5,
model: model, // pass in from browser
wordConfidence: true,
};
speechToText.recognize(params)
.then(results => {
console.log(JSON.stringify(results, null, 2));
console.log(results.result.results[0].alternatives[0].transcript);
console.log(results.result.results[0].alternatives[0].confidence);
admin.firestore().collection('Users').doc(userID).collection('Pronunciation_Test').doc('downloadURL').update( {
response: {
transcript: results.result.results[0].alternatives[0].transcript,
confidence: results.result.results[0].alternatives[0].confidence
} // make an object so that one listener can get both fields
})
.then(function() {
console.log("Document updated.");
})
.catch(error => console.error(error));
})
.catch(error => console.error(error));
})
.catch(error => console.error(error));
});

The first line is the cloud function trigger. When a new downloadURL is written to the database location, the cloud function triggers.

Next, I pull in three libraries. axios is a modern Node package for handling http requests. We also get the IBM Watson Speech-to-Text SDK and the IBM Cloud auth SDK.

I send my api-key and url to the Speech-to-Text SDK. I got these from my IBM Cloud service dashboard.

I take four constants from the trigger: the downloadURL, keywords array, model, and userID.

Now I’m ready to send the http request to Firebase Storage to get the audio file. Setting responseType: 'stream', makes a Node createReadStream , in other words, the audio file is streamed. The reason we want to stream the data is that streaming using less memory. If we read the audio file into memory, a 100K audio file would need 100K of memory. By streaming the file we need only 5 or 10K of memory.

Then I wait for the response from Firebase Storage. The first line of the params threw me for a couple days. It’s response.data, not response. The response is a JSON object with headers, etc. All we want now is the data stream. This problem would’ve taken me five minutes to catch with Postman, but Postman doesn’t handle streams so I couldn’t see what was coming back from Firebase Storage.

The second line of the params sets the contentType to application/octet-stream. Wait, didn’t we already set this as webm with opus encoding? This is the media type that Firebase Storage uses if you don’t specify a media type. Mozilla says ofapplication/octet-stream

This is the default for binary files. As it means unknown binary file, browsers usually don’t execute it, or even ask if it should be executed.

In other words, the media type hasn’t changed, it’s still webm/opus. We just didn’t specify the media type when we uploaded to Firebase Storage.

Google Voice can’t handle the webm/opus media type. To send the file to Google Voice would require processing (recoding) the file using ffmpeg.

The rest of the params tell IBM Cloud Speech-to-Text what I want.

Now I call the Speech-to-Text SDK with speechToText.recognize(params). I wait for the results, log them, and write them to the Firestore database.

My AngularJS HomeController.js has a listener to listen for changes in that database location:

firebase.firestore().collection('Users').doc($scope.user.uid).collection("Pronunciation_Test").doc('downloadURL')
.onSnapshot(function(doc) {
// console.log("Response: ", doc.data().response);
$scope.normalizedText = doc.data().response.transcript; // displays this text
$scope.confidence = doc.data().response.confidence; // displays this confidence level
$scope.$apply();
});

And the user sees the response in the browser.

Works great! Except for one thing. The response time is one to two minutes, for a three-second recording. No user wants to wait that long. I archived the code and started over, using the streaming SDK.

You can see a demo of Transcribe from Microphone, with Word Confidence.

The first step is to install the watson-speech SDK Node module with npm install watson-speech. Install it in your project directory to make linking easier. Now link to the file in the index.html:

<script src="node_modules/watson-speech/dist/watson-speech.min.js"></script>

Update watson-speech at least every month. I update all my Node modules weekly.

watson-speech is both Speech-to-Text and Text-to-Speech. There might be a way to call only watson-speech/speech-to-text and save load time.

Next, you need a bearer token. You must make the token request from your server, not from the browser, because your request will include your IAM api-key.

(You can get a token from an old Cloud Foundry Speech-to-Text service with your username and password but I found the Cloud Foundry billing to be a nightmare. Every few months I’d get an email from someone in Mexico telling me to call them with my credit card to pay a $0.38 invoice or they’d shut down my IBM Cloud account. They couldn’t show me the invoice or how to set up a credit card in Cloud Foundry. It looked like a phishing expedition, but it was how IBM Cloud Foundry handles billing. The new IAM services uses IBM Wallet, which makes it easy to see the invoices and put in your credit card for billing.)

My project is serverless. I use Firebase Cloud Functions, which is more or less Google Cloud Functions. I wrote a Node.js function to get a bearer token:

exports.IBM_IAM_Bearer_Token = functions.firestore.document('Users/{userID}/IBM_Token/Token_Request').onWrite((change, context) => {
const axios = require('axios');
const querystring = require('querystring');
return axios.post('https://iam.cloud.ibm.com/identity/token', querystring.stringify({
grant_type: 'urn:ibm:params:oauth:grant-type:apikey',
apikey: 's00pers3cret'
}))
.then(function(response) {
console.log(response.data.access_token);
admin.firestore().collection('Users').doc(context.params.userID).collection('IBM_Token').doc('Token_Value').set({ 'token': response.data })
.then(function() {
console.log('Token written to database.');
})
.catch(error => console.error(error));
})
.catch(error => console.error(error));
});

The first line is the trigger. My AngularJS controller starts with:

firebase.firestore().collection('Users').doc($scope.user.uid).collection('IBM_Token').doc('Token_Value').onSnapshot(function(doc) {
$scope.token = doc.data().token;
// $scope.token = doc.data().token.access_token;
});
firebase.firestore().collection('Users').doc($scope.user.uid).collection('IBM_Token').doc('Token_Request').set({ request: Math.random() });$scope.normalizedText = undefined;
$scope.confidence = undefined;

When the page loads the second code block writes a random number to a Firebase Firestore database location that triggers the Firebase Cloud Function. The first code block is a listener that listens for a new token. The listener should start listening before a new token is requested. The last two lines clear the displayed text.

The user loads the page, a random number is written to the database, and the Cloud Function triggers. The next two lines load two Node modules: axios and its helper querystring. These Node modules help with HTTP requests.

Next, the Cloud Function make the HTTP request, sending the API key to the URL. The response comes back and response.data is written to the database.

IBM Cloud returns a JSON token object:

{
access_token: "eyJraWQiOiIyMD..."
expiration: 1585780896
expires_in: 3600
refresh_token: "OKDoBLmGnqDoLX4bVO..."
scope: "ibm openid"
token_type: "Bearer"
}

The access_token is what you send to IBM Cloud to start the Speech-to-Text service. expiration is the date, in one hour. Therefresh_token enables refreshing the token.

My AngularJS listener puts the token object on the $scope.

I use the same template with the two buttons to start and start the handler function.

The documentation for the IBM Cloud Speech-to-Text SDK is here.

Let’s get started on the handler function. The first three lines are:

$scope.startWatsonSpeechToText = function() {
$scope.normalizedText = "Wait...;
$scope.confidence = undefined;

This fires when the user clicks the Start button. The user is told to wait while the SDK connects to the IBM Cloud.

Next,

console.log (WatsonSpeech.version);

This logs the SDK version that is hooked up. I was updating the Node module diligently for three years, not realizing that my code linked to an old SDK installed with bower that was three years out of date. No wonder I thought IBM Cloud Speech-to-Text was flaky!

Next,

var expiry = new Date($scope.token.expiration * 1000);
var now = Date.now();
var duration = -(now - expiry);
function msToTime(duration) {
var milliseconds = parseInt((duration % 1000) / 100),
seconds = Math.floor((duration / 1000) % 60),
minutes = Math.floor((duration / (1000 * 60)) % 60),
hours = Math.floor((duration / (1000 * 60 * 60)) % 24);
hours = (hours < 10) ? "0" + hours : hours;
minutes = (minutes < 10) ? "0" + minutes : minutes;
seconds = (seconds < 10) ? "0" + seconds : seconds;
return "Token expires in " + minutes + ":" + seconds + " minutes:seconds";
}
console.log(msToTime(duration))

This logs the minutes and second left before your token expires. Trust me, one day you’ll be scratching your head as to why your auth failed, and it’s because it’s been more than an hour since you got a token.

Next, we set up the options:

const options = {
accessToken: $scope.token.access_token,
objectMode: true,
format: false,
wordConfidence: false,
model: "en-US_BroadbandModel",
interimResults: true,
};

If you’re using the old Cloud Foundry username/password auth to get your token, then the first property is token or access_token. With IAM auth it’s accessToken. That was one of those gotchas that took a couple days to figure out!

Note that we’re providing just the access_token, not the entire JSON token object.

The documentation for the option parameters are here. Remember that SDK version 37.0 and later uses camelCase, not under_score property names.

I’m not sure what objectMode does but you don’t get results without it.

Now we’re ready to call WatsonSpeech.

var stream = WatsonSpeech.SpeechToText.recognizeMicrophone(options);

The IBM Cloud Speech-to-Text streaming SDK handles the WebSockets streaming for you.

Next, catch errors.

stream.on('error', function(error) {
console.log(error);
$scope.normalizedText = error;
$scope.speechToTextResults = 'red';
$scope.$apply();
});

Tell the user when the connection opens.

stream.on('open', function(error) {
// emitted once the WebSocket connection has been established
$scope.normalizedText = "Open";
$scope.confidence = " ";
$scope.$apply();
});

Tell the user to start speaking.

stream.on('listening', function(error) {
// prompt user to start talking
$scope.normalizedText = "Listening";
$scope.confidence = " ";
$scope.$apply();
});

Log or display the data when it comes in.

stream.on('data', function(msg) {
if (msg.results) {
msg.results.forEach(function(result) {
if (result.final) {
$scope.normalizedText = result.alternatives[0].transcript;
$scope.confidence = result.alternatives[0].confidence
// console.log(result.alternatives[0].word_confidence);
$scope.$apply();
} else {
// for interim results
$scope.normalizedText = result.alternatives[0].transcript;
// confidence doesn't come back until final results
$scope.$apply();
}
});
}
});

Only final results include the confidence measure. confidence is for the utterance; word_confidence shows which words Watson was unsure of. To display word_confidence set wordConfidence = true and see the code for Transcribe from Microphone, with Word Confidence.

Stop the session and close the handler function.

   $scope.stopWatsonSpeechToText = function() {
stream.stop();
};
}; // close handler function

The user clicks Start, starts talking, and words start to appear in a second or two, almost in realtime. Streaming is much faster and a better user experience, and less code for you to write.

Written by

I make technology for speech clinics to treat stuttering and other disorders. I like backpacking with my dog, running competitively, and Russian jokes.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store