Some fun with “Watson Text to Speech” and voice model customization

My last blog post was about Watson Speech to Text language model customization and this blog post is about IBM Cloud Watson Text to Speech (TTS) custom voice model configuration. Because, now it’s time to have some fun with the Watson TTS service. I created a fun customisation of the service that the German pronunciation sounds a little bit like the palatinate dialect.

Here are the differences with two wav file I created with a custom Watson Text to Speech voice model.

  • German standard
  • German which sounds a bit link the palatinate dialect

You can get the code and a fast technical overview how to customize a voice model using cURL inside a bash script in the GitHub project watson-tts-invocation.

This is an extract of the IBM Cloud Catalog:

“The Text to Speech service converts written text to natural-sounding speech. The service streams the synthesized audio back with minimal delay. The audio uses appropriate cadence and intonation for its language and dialect to provide voices that are smooth and natural. The service can be used in applications such as voice-automated chatbots, as well as a variety of voice-driven and screenless applications, such as tools for the disabled or visually impaired, video narration and voice over, and educational and home-automation solutions.”

IBM Cloud documentation

Simplified architecture dependencies

Let’s have a short look at the simplified architecture dependencies in following image.

The TTS service offers the possibility to customize a language voice model, with an custom voice model containing custom words.

custom voice model is used to extend the base model for example with a new accent or jargon for a region. The custom words do extends the custom voice model with written text which contains unknown words or map known words to a different pronunciation. Therefor the “IBM Watson™ Text to Speech uses synthesize text applies language-dependent pronunciation rules. The service applies the rules to convert the ordinary (orthographic) spelling of each word to a phonetic spelling. A word’s phonetic spelling uses phoneme symbols to define how the word is pronounced. These symbols are the distinct units of sound that distinguish words in a language, the boundaries between syllables, and the stress marks for the syllables. (partly form the IBM Cloud documentation)

The service has a new beta feature to “Tune by Example”. This feature let you control exactly how specified text is spoken by the service. The feature lets you dictate the intonation, stress, tempo, cadence, rhythm, and pauses of the synthesized text. These aspects of speech are collectively referred to as prosody.

Now you can create “speaker models” and configure ist with custom prompts with a sound and text combination.

The automation example

The links in the following text are pointing directly to the related source code in the GitHub project watson-tts-invocation.

The bash automation script automation includes two flows and here is one flow:

  1. Creation of a custom model: This are the steps:
    1. Create a custom voice model
    2. List all customized models
    3. Create words
    4. Use the custom voice model

The following code is an extract of the words configuration in the pfaelzer-words.json file of the GitHub project.

{"words":[ 
    {"word":"ich", 
     "translation":"isch"
    },
    {"word":"Ich", 
     "translation":"isch"
    },
    {"word":"rede", 
     "translation":"babbel"
    },
    {"word":"ein", 
     "translation":"ähn"
    },
    {"word":"komme", 
     "translation":"kumm"
    },
    {"word":"der", 
     "translation":"dä"
    },
    {"word":"Pfalz", 
    "translation":"Pallz"
    },
    {"word":"Pfälzer", 
    "translation":"Pällzer"
    }
  ]
}

Summary

The Watson TTS service is easy to configure with the curl command and needs no additional interfaces. The API documentation is great and here is a list of API calls I used in the example:

By the way, the getting started in the IBM Cloud documentation is very good.


I hope this was useful for you and let’s see what’s next?

Greetings,

Thomas

#watsontexttospeech, #bashscript, #tts, #ai, #ibmcloud

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑

%d bloggers like this: