Automatic speech recognition (ASR)

With SpeechKit, you can easily convert a user’s speech into text.

User experience

When performing speech recognition, it is important to indicate to the users that you are listening and that they are being heard.

Most speech recogniton systems use the following guidelines to acheive a strong user experience.

Earcons

An earcon is a brief sound that acts as a signal to convey system information.

  1. Play an earcon when the system starts listening
  2. Play an earcon when the system is done listening
  3. Play an earcon when the transaction has been cancelled
  4. Play an earcon when the an error has occurred

Display

  1. Display visual feedback to indicate the volume of the user’s voice.
  2. Provide a button to allow the user to start a transaction.
  3. Provide a button to allow the user cancel the current transaction.
  4. Display an appropriate error message if an error has occurred.
  • Refer to the page on Volume Info to learn how you an retrieve the volume of the user’s voice.

Starting a transaction

To start an ASR transaction, simply create a Session and start a transaction with your desired recognition type, detection type, language, and result delivery type.

  • RecognitionType will help optimize your ASR results. Built-in types are Dictation, Search, and TV. Your choice will depend on your application. Each type will better recognize some words and wrestle with others.

  • DetectionType will effect when the system thinks the end user is done speaking.

    • Setting this to Short is recommended for most use cases.
    • Setting this to Long allows your user to speak multiple sentences with short pauses in between.
    • Setting this to None disables end-of-speech detection and requires you to tell the transaction to stop recording. This usually requires the user to press and hold a button when speaking. Your user interface should make this clear.
  • ResultDeliveryType will determine when you get ASR results.

    • Setting this to FINAL will give you a single ASR result, when the user is done speaking and the transaction is finished processing.
    • Setting this to PROGRESSIVE will give you several ASR results, as the user is speaking. The last result you receive before onSuccess is called, is the final result.
Session session = Session.Factory.session(this, Configuration.SERVER_URI, Configuration.APP_KEY);

Transaction.Options options = new Transaction.Options();
options.setRecognitionType(RecognitionType.DICTATION);
options.setDetection(DetectionType.Short);
options.setLanguage(Language.ENG_USA);
options.setResultDeliveryType(ResultDeliveryType.FINAL);

Transaction transaction = session.recognize(options, new Transaction.Listener() {
    public void onStartedRecording(Transaction transaction) { ... }
    public void onFinishedRecording(Transaction transaction) { ... }
    public void onRecognition(Transaction transaction, Recognition recognition) { ... }
    public void onSuccess(Transaction transaction, String s) { ... }
    public void onError(Transaction transaction, String s, TransactionException e) { ... }
});

Warning

Word Streaming (ResultDeliveryType.PROGRESSIVE) is not yet available to Silver or Gold customers. If you are an Emerald or managed customer and need this feature please contact your Nuance representative to have it activated.

Canceling a transaction

If your users want to cancel their recognition, you can do so by simply calling cancel.

transaction.cancel();

Stop listening

If you’ve set the detection type to None, then you will need to tell the system to stop listening. Otherwise, the system will do this automatically for you.

transaction.stopRecording();

Interpreting the result

When ASR results are returned, the below listener will be called. You will get both the top result and an NBest list of all the results (from most confident to least). Each result will include a confidence score that indicates how certain the system is that the result is a correct transcription (interpretation) of the input.

public void onRecognition(Transaction transaction, Recognition recognition) {
    //Take the best result
    String topRecognitionText = recognition.getText();

    //Or iterate through the NBest list
    List<RecognizedPhrase> nBest = recognition.getDetails();
    for(RecognizedPhrase phrase : nBest) {
        String text = phrase.getText();
        double confidence = phrase.getConfidence();
    }
}