Customizing speech synthesis
Tailor your assistant's speech to echo your brand's unique tone and personality. With a suite of customization tools, you can fine-tune pitch, speed, accents, and even add a sprinkle of emotion.
Let´s start with it

Craft compelling texts for your digital voice assistant. Insert your message into the Speech text window in the Message node.
Speech synthesis markup language (SSML)
Get ready to fine-tune your digital assistant's voice with some SSML magic! 🪄 Let's play with pitches, pauses, and personalities.
SSML stands for Speech Synthesis Markup Language, and it is an XML-based markup language used to control the synthesis of text-to-speech output. SSML provides a way to specify the pronunciation, intonation, and other aspects of the synthesized speech, giving developers fine-grained control over how the text is spoken.
Some examples of what can be controlled with SSML include:
Voice selection: You can choose from a variety of pre-defined voices, including different genders, accents, and languages, or even specify a custom voice using SSML. SSML is used in a variety of applications, such as voice assistants, text-to-speech software, and automated voice response systems, to provide a more natural and user-friendly experience for users.
Pronunciation: You can specify the pronunciation of a word or phrase, including how individual phonemes should be pronounced.
Prosody: You can control the pitch, volume, and rate of the speech, as well as insert pauses, emphasis, and other effects to give the synthesized speech a more natural-sounding intonation and rhythm.
SSML tags
In the context of SSML, tags are elements of XML markup language that are used to provide instructions for controlling the synthesis of text-to-speech output. These tags are enclosed within angle brackets (< and >), and are used to define specific elements or attributes that are recognized by SSML processors.
Generally, there are two types of tags:
Taking the closer look
Delve into the Essential SSML Tags: Enhancing Your Assistant's Voice with Tag Mastery
<break> allows you to insert a pause in the speech. The time attribute specifies the duration of the pause, and the strength attribute specifies the strength of the pause.
<break time="10ms"/>
Pause of 10 miliseconds
<break strength="strong"/>
Pause of set strenght Possible attribute values: x-weak, weak, medium, strong, x-strong
<break time="500ms" strength="weak"/>
Pause od 500 milliseconds and set strength
<break time ="50%"/>
Pause lasting 50 % of default
<break time = "1"/>
Pause of 1 second
Tags usage
SSML code is written in XML format and is typically embedded within the text of the document that is being processed by a text-to-speech system. Here's an example of what SSML code might look like:
Hey there' <prosody rate ="slow">I'm your friendly virtual assistant. </prosody> <break time="500ms/><prosody volume="loud"> How can I help you today?</prosody>
To use SSML-enriched text as output in your digital assistant, copy it in the Speech window in MSG_NODE in the Flow editor.

Tips and trick
Copywriting for your digital assistant should be simple and straightforward to make it easy for users to understand. It is essential to use clear and understandable questions that help the customer formulate their answers. If possible, it is helpful to use the same natural language and words, phrasing, and lexicon that are commonly used in everyday life. This will make it easier to communicate with the voicebot.
Keynotes:
In the case of voicebot, you do not have to stick to spelling and grammar as in other types of communication. Of course, spelling and grammar are still important and have a very strong influence on synthesis quality.
However in some cases, grammatically, orthographically and typologically correct text can lead to a result that is not pleasant to listen to, negatively affecting intonation and spoiling the quality of the synthesis.
Therefore, work consciously with this characteristic of the synthetic voice and make "mistakes" on purpose. E.g., deleting or adding commas in a sentence, can significantly improve to the quality of the synthesis.
Deliberately using the wrong spelling can also have its advantages and, in some cases, improves the pronunciation of certain words or phrases.
Copywriting structure
Sentence structure
The two basic components of a sentence are topic (téma) and comment (rhéma). The topic is the word or group of words that determine what the sentence is about, while the rheme is part of the sentence that follows and completes the topic. For example, in the sentence "My dog's name is Max", the topic is "My dog" and the rheme is "is named Max". Topic and rheme are important for proper sentence construction and are essential for understanding meaning.
In English, German, or other languages word order is usually set, but that does mean that we cannot use this principle to our advantage as well. In English, this principle is reversed, the more important information is put in front:
Multiple choice question
CZ:
Voláte kvůli své objednávce, případně dříve zakoupeného výrobku? Stačí mi jednoduchá odpověď ano nebo ne.
Zboží můžete vrátit na prodejně, kurýrem nebo přes zásilkovnu. Kterou z těchto možností zvolíte?
EN:
Are you calling to unblock your account? Please answer with yes or no.
Multiple questions in a single message
If you ask the question at the end of the speech, the customer has already heard all the relevant information and is ready to respond.
In general, people are more likely to retain the freshest information in their memory, i.e. the information they heard last (see theme).
This will increase the likelihood that the customer will respond concisely and appropriately, and the voicebot will recognize everything correctly and provide the most complete answer to the customer's query.
The question asked will indicate to the customer that it is time for him to start talking.
Supporting argument for positioning question at the end of the speech:
CZ:
Chtěl bych vám nabídnout konzultaci s naším expertem. Ozve se Vám a zdarma Vám provede kalkulaci toho nejvýhodnějšího pojištění. Máte zájem?
Rádi bychom Vám poskytli konzultaci zdarma. Ozve se Vám náš expert s kalkulací, které pojištění by pro Vás bylo nejvhodnější. Souhlasíte?
Rather, the solution is built on alternating between periods when the voicebot is speaking and not listening (running text-to-speech synthesis) and when the voicebot is silent, listening and evaluating the transcript of the response (running speech-to-text transcription). If the customer speaks without the voicebot finishing speaking, speech-to-speech transcription does not run and part of the response is lost, which can lead to misrecognition of intent.
It's better if the text of the voicebot contains the question at the end of the speech. This will help to prevent the customer from jumping into the voicebot's speech. The solution is not designed to allow the customer to interrupt the virtual assistant, while the voicebot is able to go back and finish the rest of its speech.
Position of the question in digital assistant's message
CZ:
Zkusíme to znovu?
Vyplnil jste všechny údaje správně?
Přejete si reklamaci řešit raději písemnou cestou?
A question should never be phrased negatively. This is because people are generally more willing to accept positive information and ideas than vice versa. Phrasing the question negatively can reduce the likelihood that the voicebot will correctly understand the customer's answer.
CZ:
Nechcete to zkusit znovu?
Neudělali jste chybu?
Jste si jistý, že jste neudělal chybu?
Nepřejete si reklamaci raději řešit písemnou cestou?
SK:
Nechcete to skúsiť znova?
Neurobili ste chybu?
Nechceli by ste svoju sťažnosť riešiť radšej písomne?
EN:
Won't you try again?
Don't you want to deal with reclamation via e-mail?
What would answer yes (ano/áno) mean in this case? Yes, I want or Yes, I indeed don't want to? This situational context cannot be discerned with 100% confidence, so we recommend phrasing questions positively or neutrally.
Question wording is a key element of the voicebot scenario. The wording of the question influences the phrasing of customer's answer, and therefore the intent that must be trained for the virtual assistant to recognize the answer.
Question formulation
Based on the length of the sentence sometimes the quality of the synthesis can be quite fluctuating, appearing artificial and not very natural. In such cases, it is useful to shorten the copywriting or add or edit some words to the text to make the voice synthesis sound better. This approach can help create a more natural and beautiful voice synthesis without the need for complicated settings or SSML tags.
Intonation best practice
Master the art of customizing speech synthesis intonation with SSML on our dedicated documentation page. Learn to personalize your AI voice with precision and creativity for a truly tailored and engaging auditory experience!
Intonation curve
Text-to-speech synthesis is also able to automatically detect the sentence type and set the corresponding intonation curve. This means that you can easily create a synthesized voice that sounds natural and matches the text you enter accurately.
Pro-tip! ✨ For neural voices provided by Microsoft Azure, feel free to use the Audio content creator tool in Azure Speech Studio. Fine-tune synthesized speech audio to fit your scenario. Define lexicons and control speech parameters such as pronunciation, pitch, rate, pauses, and intonation

In Azure Audio Content Creator, you can adjust intonation in sections, which means you can set intonation for each sub-section, phrase, or word separately, instead of having to set intonation for the whole sentence. This allows you to capture different intonation nuances more accurately and gives you more control over how the neural voice will appear.

Accentuating intonation allows listeners to distinguish between the different ideas and information contained in a sentence. By adjusting intonation curves, you can highlight important points or emphasize changes in mood or emotion, which will help the listener better understand and remember what the voice is saying.
Melody patterns
Here are some examples of inflexion patterns you can achieve:
To create a rising intonation pattern, you would place anchor points at the beginning of a phrase or sentence and gradually raise the pitch as you move forward in time. This pattern is commonly associated with questions or uncertainty.
Want more detailed tips ⁉️
Pauses and breaks best practice
Basics
The <break> tag is used to create a pause of a given length in the text. This tag can be used, for example, to express a pause between words or sentences, which can help to improve the naturalness of the delivery.
The <break> tag in SSML is used to insert a pause or a break in the speech output. It allows you to specify the duration and the strength of the pause. The <break> tag can be used with the following attributes:
time: Specifies the duration of the pause in seconds or milliseconds. For example, <break time="500ms"/> inserts a pause of 500 milliseconds.

strength: Specifies the strength or intensity of the pause. It can have values like "none", "x-weak", "weak", "medium", "strong", or "x-strong". The actual interpretation of these values may depend on the specific text-to-speech engine used.

Example
Hello, <break time="500ms"/> how are you today?
Hello, <break strenght="medium"/> how are you today?
Pronunciation best practice
Phoneme-defined pronunciation
Stress can have a huge impact on pronunciation.
When a word is stressed, the stressed syllable is typically pronounced with greater intensity, higher pitch, and longer duration compared to unstressed syllables. Additionally, the quality of vowels in stressed syllables may also be affected, with stressed vowels often being pronounced with more clarity and fullness.
In 🇨🇿Czech, stress is generally fixed on the first syllable of a word. This means that the first syllable receives the primary stress, while the subsequent syllables have secondary stress or are unstressed. However, it's important to note that stress patterns can vary depending on the word and its inflectional or derivational forms.
To provide some examples in IPA (International Phonetic Alphabet), let's consider a few Czech words:
"kniha" (book):
/ˈkɲɪɦa/
The primary stress falls on the first syllable (/kɲ/), making it more prominent in pronunciation.
"univerzita" (university):
/ˌunɪvɛrˈzɪta/
The primary stress is on the second syllable (/ɪv/), while the first syllable (/u/) carries secondary stress. The following syllables (/ɛrˈzɪt/ and /ta/) are unstressed.
"překvapení" (surprise):
/ˌpr̝̊ɛkvaˈpɛɲi/
The primary stress is on the second syllable (/ɛkva/), and the first syllable (/pr̝̊/) carries secondary stress. The final syllable (/ɲi/) is unstressed.
These examples illustrate the general stress patterns in Czech, where the stressed syllables are emphasized in terms of intensity, pitch, duration, and sometimes vowel quality. It's important to consult native speakers or audio resources to further refine your pronunciation and understand the intricacies of Czech stress patterns.
Foreign words pronunciation
When using Azure's Speech Studio with neural voices to handle foreign word pronunciation, here are some tips to ensure accurate pronunciation:
Phonetic spelling: Provide a phonetic spelling of the foreign words using the International Phonetic Alphabet (IPA) or a transcription system familiar to the base language neural voices. This helps the TTS system understand the correct pronunciation of the word.
Lexicon customization: Utilize the lexicon customization feature in Azure's Speech Studio to add pronunciation rules for specific foreign words. This allows you to specify the pronunciation of each word or phrase more precisely.
Pronunciation rules: Create pronunciation rules for common patterns found in foreign words. For example, if there is a consistent pattern of stress in the foreign language, you can define rules to apply stress in the appropriate position.
Contextual cues: Provide additional context within the text to help guide the TTS system's pronunciation. This could include nearby words or phrases that assist in determining the correct pronunciation of the foreign word.
Test and iterate: After applying the above techniques, listen to the generated speech and identify any mispronunciations. Adjust the phonetic spellings, lexicon entries, or pronunciation rules as necessary and continue testing until the desired pronunciation is achieved.
It's important to note that while these tips can improve the accuracy of foreign word pronunciation in TTS systems, the results may still vary. TTS systems are trained on large datasets and generalize pronunciation based on the language's phonetic patterns. Handling foreign words can be challenging due to the diverse pronunciation rules across languages.
🇨🇿 Examples:
Some phrases or individual words from foreign vocabulary are trained in the default text-to-speech model and synthesized is smooth, localized to base language, and pleasant to the ear. Sounds fine without adjustments:
Nejpoužívanější vyhledávač v Česku je Seznam, nikoliv Google.
Letíme na dovolenou se společností Lufthansa.
Koupím ojetý renault.
Other phrases or words can be very similar and comprehensible, with a few tweaks here and there. Even though their pronunciation is nearly correct, this can cause an uncanny valley effect:
Ceny maji jako Deutsche bahn, ale služby jako nejposlednější drožka.
deutsche is pronounced correctly like
[dɔ͡ɪˈt.ʃɛ.]
, but bahn sounds like[ba.ɦaːˈ.ẽˈ]
and would be needed to be adjusted
Potřebuji znát vaši IP adresu.
IP being pronounced like
[iːˈ.pɛː]
, which would be comprehensible, but in the case would be better to adjust English pronunciation to Czech as[a͡j.piː]
🇨🇿Pronunciation of numbers followed by currency signs is quite different from common reading rules.With prices, sums of money, or values, we often omit words describing the decimal order of fractional part ([desetiny, setiny]
)
Integers
🇨🇿 Czech language
If digits have fewer than or exactly 6 digits, they are always read decadically as default, regardless of whether a space separates orders of thousands 1234 1 234
Both numbers are pronounced as: TISÍC DVĚ STĚ TŘICET ČTYŘI
12345 12 345Both numbers are pronounced as: DVANÁCT TISÍC TŘI STA ČTYŘICET PĚT
123456 123 456Both numbers are pronounced as: STO DVACET TŘI TISÍC ČTYŘI STA PADESÁT ŠEST
If digits have 7 or more characters, they are read decadically only if the order of thousands is separated by a space. Numeric string notation without spaces defaults to reading each digit in turn. 1234567
is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST SEDM
1 234 567is pronounced MILION DVĚ STĚ TŘICET ČTYŘI TISÍC PĚT SET ŠEDESÁT SEDM
123456789is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST SEDM OSM DEVĚT
123 456 789is pronounced STO DVACET TŘI MILIONÚ ČTYŘI STA PADESÁT ŠEST TISÍC SEDM SET OSMDESÁT DEVĚT
If we need shorter numbers to be read each digit in turn, there are several ways to do it A. Add spaces in between 1 2 3 4 5 6
is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST
B. Add commas in between 1, 2, 3, 4, 5, 6is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST with more distinct pauses between each of them
C. Use SSML alias for spelling (hláskování) <say-as interpret-as="spell" format="undefined">123456</say-as>is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST
If we need longer number to be read each digit in turn, copywriting input needs to be adapted accordingly A. Use numeric string without spaces 1234567890 B. Separate each digit with spaces 1 2 3 4 5 6 7 8 9 C. Separate each digit with interpunction 1,2,3,4,5,6,7,8,9 These numbers will be pronounced as
JEDEN DVA TŘI ČTYŘI PĚT ŠEST SEDM OSM DEVĚT
Phone numbers
🇨🇿 Czech language
For the pronunciation of telephone numbers:
A. Write them down in iso format including prefix Volejte +420 800 148 148
is pronounced VOLEJTE PLUS ČTYŘI STA DVACET, OSM SET, STO ČTYŘICET OSM, STO ČTYŘICET OSM
B. Separate custom sections with interpunction Volejte 800, 148, 148
is pronounced OSM SET, STO ČTYŘICET OSM, STO ČTYŘICET OSM
Volejte 212-456-789is pronounced DVĚ STĚ DVANÁCT, ČTYŘI STA PADESÁT ŠEST, SEDM SET OSMDESÁT DEVĚT
Volejte 800, 54, 12, 12is pronounced OSM SET, PADESÁT ČTYŘI, DVANÁCT, DVANÁCT
Volejte 800, 12, 7, 7, 7. 7is pronounced OSM SET, DVANÁCT, SEDM, SEDM, SEDM, SEDM
C. Write them down as alphabetical string Volejte osm set dvanáct čtyři sedmničky Volejte osm set dvanáct sedm sedm sedm sedm
D. Separate sections with spaces, as long it's meant to be pronounced in 3-2-2-2 or 3-2-1-1-1-1 800 54 12 12
is pronounced OSM SET, PADESÁT ČTYŘI, DVANÁCT, DVANÁCT
800 54 1 2 1 2is pronounced OSM SET, PADESÁT ČTYŘI, JEDNA, DVA, JEDNA, DVA
Long numerical strings (IDs, codes)
🇨🇿 Czech language
For pronunciation of long numerical strings (order IDs etc.):
A. Write numerical strings without spaces to be read one digit in turn Objednávka 01304578931
Objednávka NULA JEDNA TŘI NULA ČTYŘI PĚT SEDM OSM DEVĚT TŘI JEDNA
B. Customize pronunciation by breaking it into smaller chunks with interpunction Objednávka 01, 30, 45, 78, 9-3-1
Objednávka NULA JEDNA, TŘICET, ČTYŘICET PĚT, SEDMDESÁT OSM, DEVĚT TŘI JEDNA
Numeric date notation
🇨🇿 Czech language Dates are pronounced correctly by default when put down in ISO format:
A. ISO DD-MM-YYY 01-11-1995 1-11-1995
Pronounced as [1. listopadu 1995]
B. ISO YYYY-MM-DD 1995-11-01 1995-11-1
Pronounced as [1. listopadu 1995]
C. ISO DD.MM.YYYY datum 01.11.1995 datum 01. 11. 1995 datum 1.11.1995 datum 1. 11. 1995
Pronounced [as 1. listopadu 1995]
A safe and simple way is to transcribe the date alphanumerically:
datum 1. listopadu 1995 prvního listopadu 1995
Pronounced as [1. listopadu 1995]
prvního jedenáctý 1995
Pronounced as [prvního jedenáctý 1995]
These formats WON'T WORK and will be pronounced incorrectly, even if tagged with SSML alias reading rules for dates
DD.MM - 1.11. -
[jedna hodina jedenáct minut]
DD. MM - 1. 11. -
[první jedenáctý]
DD/MM - 01/11 -
[nula jedna lomítko jedenáct]
DD/MM/YYYY - 01/11/1995, 1/11/1995 -
[nula jedna lomítko jedenáct lomítko 1995]
SSML alias reading rules not supported
To be pronounced correctly, dates containing only date + month must be written manually
prvního listopadu dne 2. listopadu 31. března 17. listopad
Time notation
🇨🇿 Czech language
Standard iso format is supported and is pronounced as time correctly by default
HH:mm - 15:52 -
[Patnáct hodin padesát dvě minuty]
HH:mm:ss - 15:52:25 -
[Patnáct hodin padesát dva minut dvacet pět sekund]
HH.mm - 15.52 -
[Patnáct hodin padesát dva minut]
Whole hours, even if written down digital, are pronounced as analog time
HH:00 - 15:00 -
[Patnáct hodin]
HH:00:00 - 15:00:00 -
[Patnáct hodin]
The Czech SI unit system IS NOT SUPPORTED
Won't pronounce hod or h as hodin/hodiny 15 hod -
[patnáct hod]
15 h -[patnáct há]
Won't pronounce min or mins as minut/minuty 15 min -[patnáct min]
15 hod 15 min -[patnáct hod patnáct min]
In case you need time to be pronounced as analog or Czech SI units to be pronounced correctly, it is necessary to transcribe your input
15 hod → 15 hodin
15 min → 15 minut
2 min → 2 minuty
15:15 → čtvrt na čtyři | čtvrt na čtyři
12:00 → poledne
12:30 → půl jedné odpoledne
Customizing speech synthesis of variables
A variable is a named storage location that holds a value in computer programming. It is a fundamental concept used to store and manipulate data within a program. In the context of voicebots and speech applications, variables can be used to store and retrieve information that is relevant to the conversation or user interaction.
General use
To use variables in the speech output of a voicebot, you need to incorporate the variable values within the text that the voicebot will read out loud (in the Message node, fill in the Speech window).
General approach:
Define and store the variable values: In your voicebot's code or script, define and store the necessary variable values based on user input or other relevant data. For example, you might have a variable named customer_name that stores the user's name.
Construct the speech output text: Craft the message, and include the variable values where appropriate.
Set Speech in Message node: Paste the constructed speech input for the text-to-speech engine to convert it into audible speech. Your digital assistant will then speak the generated text, incorporating the variable values dynamically.

Common problems and FAQ
Last updated
Was this helpful?