Speech synthesis (TTS)

SpeechKit makes it easy to perform text-to-speech conversion.

Starting a transaction

To start a TTS transaction, simply create a session and start a SKTransaction with your desired voice and language. If you don’t specify a voice, then the default voice for the selected language will be used.

Session session = Session.Factory.session(this, Configuration.SERVER_URI, Configuration.APP_KEY);

Transaction.Options options = new Transaction.Options();
options.setLanguage(Language.ENG_USA);
options.setVoice(Voice.SAMANTHA); //optional

String textToSpeak = "Hello World";

Transaction transaction = session.speakString(textToSpeak, options, new Transaction.Listener() {
    public void onAudio(Transaction transaction, Audio audio) { ... }
    public void onSuccess(Transaction transaction, String s) { ... }
    public void onError(Transaction transaction, String s, TransactionException e) { ... }
});

Playing the audio

After the Audio is returned from the server, it will be played automatically.

You can however indicate that you do not want the system to automatically play the Audio and play it yourself.

options.setAutoplay(false);

...

public void onAudio(Transaction transaction, Audio audio) {
    session.getAudioPlayer().playAudio(audio);
}
For more info on the AudioPlayer, see the topic Audio Playback

Pausing and resuming the audio

A neat thing you can do with the audio being played is to pause and resume it when you want.

To do this, simply use the session’s AudioPlayer

...

[session.audioPlayer pause]
...

[session.audioPlayer play]

SSML

SSML is a standardized markup language that gives you control over different aspects of speech synthesis output such as rate, pitch, and volume. It also allows you to insert pauses and control other aspects of how the text is read. To use SSML you will need to format your request in SSML and then, instead of using the speakString method for your transaction request use the speakMarkup method.

The Nuance Cloud Services supports SSML v1.0 – W3C Recommendation 7 September 2004 with the following qualifications and exceptions:

  • <emphasis> Fully supported except for the ‘none’ level. Also the system may choose to ignore this element in order to produce optimal natural speech output.
  • <voice> Fully supported except for the variant attribute. Also the age attribute is only useful when using custom voices.
  • <prosody> Fully supported except for the duration, pitch, pitch-range, and contour values.
  • <break> Fully supported. However setting the strength attribute to ‘none’ only has an audible effect when the TTS engine would have inserted a sentence break without an explicit <break> element.
  • <meta> Fully supported except for the http-equiv attribute.
  • <say-as> Not supported.
  • <lexicon> Not supported.
  • <audio> Not supported.