# Customizing speech synthesis

## Let´s start with it

<figure><img src="https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FbV12F3FDAxj6R0wPr0Xx%2Fuploads%2FZFwFIE18CY6SdFbWnfSm%2FMSG_speech_input.gif?alt=media&#x26;token=fc5f7082-a865-46a4-854d-f5cfb758d219" alt=""><figcaption></figcaption></figure>

Craft compelling texts for your digital voice assistant. Insert your message into the *Speech* text window in the Message node.

<details>

<summary>Expand to learn more</summary>

Text in the *Speech* window will be used as input for speech synthesis. Customize your text with [SSML](#speech-synthesis-markup-language-ssml) for even better results.

For top-notch voice synthesis, we've got some copywriting secrets up our sleeve! Check out our [tips and tricks](#tips-and-trick) on crafting texts that convert beautifully into speech.<br>

:robot: Provider of text-to-speech technology and choice of neural voice depends on your project configuration. Typically, we rely on Microsoft Azure as our primary text-to-speech provider. However, we're not limited to just one option! We also work with Google and other text-to-speech services, offering the possibility of custom neural voices for that extra touch of uniqueness.&#x20;

</details>

***

## Speech synthesis markup language (SSML)

**Get ready to fine-tune your digital assistant's voice with some SSML magic!**  :magic\_wand: **Let's play with pitches, pauses, and personalities.**&#x20;

**SSML** stands for Speech Synthesis Markup Language, and it is an XML-based markup language used to **control the synthesis of text-to-speech output**. SSML provides a way to specify the pronunciation, intonation, and other aspects of the synthesized speech, giving developers fine-grained control over how the text is spoken.

Some examples of what can be controlled with SSML include:

* **Voice selection**: You can choose from a variety of pre-defined voices, including different genders, accents, and languages, or even specify a custom voice using SSML.\
  SSML is used in a variety of applications, such as voice assistants, text-to-speech software, and automated voice response systems, to provide a more natural and user-friendly experience for users.
* **Pronunciation**: You can specify the pronunciation of a word or phrase, including how individual phonemes should be pronounced.
* **Prosody**: You can control the pitch, volume, and rate of the speech, as well as insert pauses, emphasis, and other effects to give the synthesized speech a more natural-sounding intonation and rhythm.

***

## SSML tags

In the context of SSML, tags are elements of XML markup language that are used to provide instructions for controlling the synthesis of text-to-speech output. These tags are **enclosed within angle brackets** (**<** and **>**), and are used to define specific elements or attributes that are recognized by SSML processors.

### Generally, there are two types of tags:

{% tabs %}
{% tab title="Pair tags" %}
Pair tags, also known as start tags and end tags, consist of **two tags that surround a block of content**. The first tag is the **start tag**, and it begins with the name of the element enclosed in angle brackets (**<** and **>**). The second tag is the **end tag**, which begins with a forward slash (**/**) followed by the name of the element enclosed in angle brackets.&#x20;

Pair tags define an element that has a beginning and an end, and the content between the tags is considered to be the value of the element.

**\<prosody>** -  start tag\
\&#xNAN;**\</prosody>** - end tag

`<prosody> ... </prosody>`

{% code overflow="wrap" %}

```ssml
This sentences in not affected by SSML. <prosody> This sentence is modulated with prosody. </prosody> This sentence is no more affected by SSML, as second sentence was enclosed with end tag.
```

{% endcode %}
{% endtab %}

{% tab title="Non-pair tags" %}
Non-pair tags, also known as empty tags or **self-closing tags**, consist of a single tag that does not surround any content.&#x20;

Non-pair tags are used to indicate that an element does not have any content but may have attributes. Non-pair tags end with a forward slash (/) before the closing angle bracket (>).

```
  ... <break time="2s"/> ...
```

{% code overflow="wrap" %}

```ssml
This is part before the break <break time="50s"/> and this is part after the break
```

{% endcode %}
{% endtab %}
{% endtabs %}

<details>

<summary>Additional information </summary>

Tags can also include **attributes**, which provide additional information about the element they modify. For example, the \<prosody> tag can include the pitch, rate, and volume attributes to control various aspects of the speech.<br>

* <mark style="color:blue;">`<prosody pitch="+10%" rate="90%">`</mark>` ``This sentence will be spoken with a higher pitch (+ 10 %) and slower rate (-10 % or 90 % of default`` `<mark style="color:blue;">`</prosody>`</mark>
* `The following text will be spoken as individual digits:`` `<mark style="color:orange;">`<say-as interpret-as="digits">`</mark>`1234`<mark style="color:orange;">`</say-as>`</mark>
* `The following sentence will be spoken with a two-second pause in the middle: Hello,`` `<mark style="color:purple;">`<break time="2s"/>`</mark>`world!`

</details>

### Taking the closer look

Delve into the Essential SSML Tags: Enhancing Your Assistant's Voice with Tag Mastery

{% tabs %}
{% tab title="<break>" %} <mark style="color:blue;">**\<break>**</mark> allows you to insert a pause in the speech. The <mark style="color:purple;">time</mark> attribute specifies the duration of the pause, and the <mark style="color:purple;">strength</mark> attribute specifies the strength of the pause.

| <mark style="color:blue;">\<break</mark> <mark style="color:purple;">time="10ms"</mark><mark style="color:blue;">/></mark>                                                     | Pause of 10 miliseconds                                                                           |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- |
| <mark style="color:blue;">\<break</mark> <mark style="color:purple;">strength="strong"</mark><mark style="color:blue;">/></mark>                                               | <p>Pause of set strenght<br>Possible attribute values: x-weak, weak, medium, strong, x-strong</p> |
| <mark style="color:blue;">\<break t</mark><mark style="color:purple;">ime="500ms"</mark> <mark style="color:purple;">strength="weak"</mark><mark style="color:blue;">/></mark> | Pause od 500 milliseconds and set strength                                                        |
| <mark style="color:blue;">\<break</mark> <mark style="color:purple;">time ="50%"</mark><mark style="color:blue;">/></mark>                                                     | Pause lasting 50 % of default                                                                     |
| <mark style="color:blue;">\<break</mark> <mark style="color:purple;">time = "1"</mark><mark style="color:blue;">/></mark>                                                      | Pause of 1 second                                                                                 |
| {% endtab %}                                                                                                                                                                   |                                                                                                   |

{% tab title="<emphasis> " %} <mark style="color:blue;">**\<emphasis>**</mark> allows to emphasize a particular word or phrase in the speech. The l. .evel attribute specifies the level of emphasis, which can be either "strong" or "moderate".
{% endtab %}

{% tab title="<prosody> " %} <mark style="color:blue;">**\<prosody>**</mark> allows to adjust the pronunciation, volume, and speaking rate of the speech. The <mark style="color:purple;">pitch, range, rate</mark>, and <mark style="color:purple;">volume</mark> attributes can be used to adjust these aspects of the speech.

| \<prosody **rate**="+5.00%"> ... \</prosody>                          | Changing rate, increasing by +5 %                                                                   |
| --------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| \<prosody **rate**="+5.00%"> ... \</prosody>                          | <p>Changing rate, to slow<br>Possible attribute values: x-slow, slow, medium, fast, x-fast</p>      |
| \<prosody **pitch**="+5.00%">...\</prosody>                           | Changing voice pitch, increase by +5 %                                                              |
| \<prosody **pitch**="hight>...\</prosody>                             | <p>Changing voice pitch, to high<br>Possible attribute values: x-low, low, medium, high, x-high</p> |
| \<prosody **volume**="+5.00%">...\</prosody>                          | Changing volume, increasing by +5%                                                                  |
| \<prosody **volume**="soft">...\</prosody>                            | <p>Changing volume, to soft<br>Possible attribute values: x-soft, soft, medium, loud, x-loud</p>    |
| \<prosody **emphasis**="strong">...\</prosody>                        | <p>Setting emphasis, to strong<br>Possible attribute values: none, moderate, strong</p>             |
| \<prosody **contour=**"(0%,+10%)(50%,+50%)(100%,+10%)">...\</prosody> | Adjusting contour of speech                                                                         |
| {% endtab %}                                                          |                                                                                                     |

{% tab title="<say-as>" %} <mark style="color:orange;">**\<say-as>**</mark> allows to specify how a particular string of text should be pronounced. The <mark style="color:red;">interpret-as</mark> attribute specifies the type of text to be interpreted, and the <mark style="color:red;">format</mark> attribute specifies the format of the text. The <mark style="color:red;">detail</mark> attribute can be used to provide additional information about how the text should be pronounced.

| \<say-as **interpret-as**="**xxx**">...\</say-as>          | <p>Attribute interpret-as set rule for entity<br>Possible attribute values: date, time, digits, character, spell, address, telephone, name, URL etc.</p>                                                                                        |
| ---------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| \<say-as interpret-as="xxx" **format="yyy"**>...\</say-as> | <p>Setting format for attribute, could be also set as "Undefined" interpret-as<br>Ex. <code>\<say-as interpret-as="date" format="md">9/1\</say as></code> set format for entity date as month-day<br>Pronounced as: <em>September, 1st</em></p> |
| {% endtab %}                                               |                                                                                                                                                                                                                                                 |

{% tab title="<phoneme>" %} <mark style="color:green;">**\<phoneme>**</mark> allows to specify the pronunciation of a particular phoneme. The <mark style="color:green;">alphabet</mark> attribute specifies the phonetic alphabet being used, and the <mark style="color:green;">ph</mark> attribute specifies the phoneme to be pronounced.

| \<phoneme **alphabet="ipa"** **ph**="bɔˈɹn.dɪˈ.dʒɪ.təl.">Born Digital\</phoneme> | <p>Attribute <strong>alphabet</strong> sets IPA (international phonetic alphabet)<br>Attribute <strong>ph</strong> sets phonemes be pronounced<br>Phonemes are transcribed with IPA Born Digital = \[bɔˈɹn.dɪˈ.dʒɪ.təl.]</p> |
| -------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| {% endtab %}                                                                     |                                                                                                                                                                                                                              |

{% tab title="<sub> " %} <mark style="color:blue;">**\<sub>**</mark> allows you to substitute one word or phrase for another. The <mark style="color:blue;">alias</mark> attribute specifies the replacement text, and the content between the start and end sub-tag represents the text to be replaced.

| \<sub alias="Born digital">BD\</sub> | <p>Substitues text with set alias<br>Ex. <em>BD</em> with alias <em>Born Digital</em></p> |
| ------------------------------------ | ----------------------------------------------------------------------------------------- |
| {% endtab %}                         |                                                                                           |
| {% endtabs %}                        |                                                                                           |

***

### Tags usage

SSML code is written in XML format and is typically embedded within the text of the document that is being processed by a text-to-speech system. Here's an example of what SSML code might look like:

> Hey there' \<prosody rate ="slow">I'm your friendly virtual assistant. \</prosody> \<break time="500ms/>\<prosody volume="loud"> How can I help you today?\</prosody>

To use SSML-enriched text as output **in your digital assistant**, copy it in the Speech window in MSG\_NODE in the Flow editor.

{% tabs %}
{% tab title="Step by step" %}

<figure><img src="https://4261467870-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6iQTvxgRZRwPS1NgIGEb%2Fuploads%2FoaS3MOPaeJJuRTzK6CXQ%2Fimage.png?alt=media&#x26;token=7283bcec-63e5-4fc2-b142-d616fbd4e3e4" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Correct tags example" %}

<figure><img src="https://4261467870-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6iQTvxgRZRwPS1NgIGEb%2Fuploads%2F3CcehuK898OvNbbwTbvS%2Fimage.png?alt=media&#x26;token=b46d06b9-ab58-44b0-92b7-806be9a51a58" alt=""><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}

***

## Tips and trick

Copywriting for your digital assistant should be **simple and straightforward** to make it easy for users to understand. It is essential to use clear and understandable questions that help the customer formulate their answers. If possible, it is helpful to use the same **natural language and words, phrasing, and lexicon that are commonly used in everyday life**. This will make it easier to communicate with the voicebot.

### **Keynotes:**

<table data-card-size="large" data-view="cards"><thead><tr><th></th></tr></thead><tbody><tr><td>In the case of voicebot, you <mark style="color:red;"><strong>do not have to</strong></mark> <strong>stick to spelling and grammar</strong> as in other types of communication. Of course, spelling and grammar are still important and have a very strong influence on synthesis quality.</td></tr><tr><td>However in some cases, <strong>grammatically, orthographically and typologically correct text</strong> can lead to a <strong>result that is </strong><mark style="color:red;"><strong>not pleasant</strong></mark><strong> to listen to</strong>, negatively affecting intonation and spoiling the quality of the synthesis.</td></tr><tr><td>Therefore, work consciously with this characteristic of the synthetic voice and <strong>make "mistakes" on purpose.</strong> <br>E.g., deleting or adding commas in a sentence, can significantly improve to the quality of the synthesis.</td></tr><tr><td><strong>Deliberately using the wrong spelling</strong> can also have its advantages and, in some cases, <strong>improves the pronunciation</strong> of certain words or phrases.</td></tr></tbody></table>

## Copywriting structure

#### Sentence structure

The two basic components of a sentence are **topic** (téma) **and comment** (rhéma). The topic is the word or group of words that determine what the sentence is about, while the rheme is part of the sentence that follows and completes the topic. For example, in the sentence *"My dog's name is Max"*, the topic is *"My dog*" and the rheme is *"is named Max"*. Topic and rheme are important for proper sentence construction and are essential for understanding meaning.

{% hint style="info" %}
The flexibility in word order is a distinctive feature of Slavic languages, particularly Czech and Slovak, due to their rich inflectional systems.
{% endhint %}

<details>

<summary>Order of words is important in SK, CZ language</summary>

Changing the word order in a sentence can affect what information we consider important:

```
CZ:
Petr přišel pozdě do školy. 
     # Important = exactly where (to school, not to date) 
Petr do školy přišel pozdě. 
     # Important = exactly when (late, not on time)
Do školy přišel pozdě Petr. 
     # Important = exactly who (Petr, not Pavel)
```

Changing the word order can therefore affect which information is emphasised.

{% code overflow="wrap" %}

```
CZ:
Klient si může zvolit, zda chce spořit na účtu nebo investovat do podílových fondů. 
Zda chce spořit na účtu nebo investovat do podílových fondů, si může klient zvolit. 

Prosím, pečlivě si pročtěte obchodní podmínky. 
Obchodní podmínky si prosím pročtěte pečlivě. 

Pravděpodobně jste se stali terčem podvodníka. Ihned musíme zablokovat platební kartu. 
Pravděpodobně jste se stali terčem podvodníka. Platební kartu musíme zablokovat ihned.`


---------

SK:
Klient si môže zvoliť, či chce sporiť na účte alebo investovať do podielových fondov. 
Či chce sporiť na účte alebo investovať do podilových fondov, si klient môže sám zvoliť. 

Prosím, pozorne si prečítajte podmienky. 
Obchodné podmienky si prosím prečítajte pozorne. 
 
Pravdepodobne ste sa stali terčom podvodníka. Musíme okamžite zablokovať vašu kreditnú kartu. 
Pravdepodobne ste sa stali terčom podvodníka. Musíme vašu kreditnú kartu zablokovať okamžite. 
```

{% endcode %}

</details>

{% hint style="warning" %}
In English, German, or other languages word order is usually set, but that does mean that we cannot use this principle to our advantage as well. In English, this principle is reversed, the more important information is put in front:
{% endhint %}

<details>

<summary>English comparission with SK, CZ language on examples</summary>

{% code overflow="wrap" fullWidth="true" %}

```
I saw a lion at the zoo yesterday.          # Important = it was a lion
Yesterday, I saw a lion at the zoo.         # Important = it was yesterday


I can only play the piano.   # Of all instruments, I play only piano (not guitar, not flute)
I only can play the piano.   # Playing piano is my only skill (not dancing, not singing)
```

{% endcode %}

{% code overflow="wrap" fullWidth="true" %}

```
CZ:
Novou kartu vám můžeme poslat poštou, případně kurýrem. Také si ji můžete vyzvednout na pobočce. Kterou variantu preferujete?”`
SK:
“Novú kartu vám môžeme poslať poštou alebo kuriérom. Môžete si ju tiež vyzdvihnúť na pobočke. Ktorú možnosť uprednostňujete?”
EN:
"We can send you a new card by post or courier. You can also pick it up at a branch. Which option of these two options do you prefer?"
```

{% endcode %}

\
\
You can also use sentences that require a multiple-choice answer, such as *"Which of the options do you wish to choose: A or B?".* Alternatively, we recommend accompanying both options with a verb so that each option is its own sentence and it is clear that these are choices, for example, "Do you want A or do you need B?\
\
\
Another strategy is to first inform the customer of the options and then ask them to choose.

{% code overflow="wrap" fullWidth="true" %}

```
CZ:
Vyberte si, zda se chcete přihlásit pomocí e-mailu, nebo pomocí Facebooku.
Řekněte mi, zda chcete zaslat fakturu na e-mail, nebo zda ji máme poslat poštou.
Potřebuji nejdřív vědět, zda již u nás máte účet, nebo ho teprve chcete založit. 

SK:
Vyberte si, či sa chcete prihlásiť cez e-mail alebo pomocou Facebooku”
“Povedzte mi, či chcete faktúru poslať e-mailom alebo či ju máme poslať poštou”,
“Najprv potrebujem vedieť, či už u nás máte účet, alebo si ho práve chcete otvoriť.”. 

EN:
"Choose whether you want to sign in via email or Facebook"
"Tell me if you want us to send you an invoice via email or whether you prefer to receive it as a paper letter"
"First, I need to know if you already have an account with us or if you just want to open one".
```

{% endcode %}

However, the number of unwanted answers can be reduced by appropriate copywriting with more **distinctive intonation** (see intonation) and emphasis on the individual options.\
One way to write a question that requires multiple choice is to use clear and specific terms that clearly identify each option. For example, instead of asking "Do you want A or B?" you can try something along the lines of:

In **spoken language**, however, this difference in meaning is unclear in both languages as well as several others (**English, German, French** etc.), and it happens that customers do not understand the question at the first attempt and answer yes/no instead of choosing from the options. It is therefore necessary to take this into account when designing the conversation and prepare the scenario for such situations.

In **Slovak**, a similar rule does not apply for or, the writing of commas is governed by different rules.

{% code overflow="wrap" fullWidth="true" %}

```
Example 1.
Yes/no question. We're asking if they're interested in a drink at all. The expected answer is yes/no.

Do you want (A or B)? 
Dáš si kávu nebo čaj?
Do you want (coffee or tea)?
```

{% endcode %}

{% code overflow="wrap" %}

```
Example 2.
We'll serve you either coffee or tea. Choose one of the two. We expect a response of "coffee, please" or "tea with honey, thanks".

Do you want A, or B? 
Dáš si kávu, nebo čaj?
Do you want (coffee) or do you want (tea)?
```

{% endcode %}

In **Czech**, we distinguish grammatically by a comma between two mutually exclusive choices. In this case, the comma is meaning-forming.

</details>

***

### Multiple choice question

{% code overflow="wrap" %}

```
CZ: 
Voláte kvůli své objednávce, případně dříve zakoupeného výrobku? Stačí mi jednoduchá odpověď ano nebo ne.
Zboží můžete vrátit na prodejně, kurýrem nebo přes zásilkovnu. Kterou z těchto možností zvolíte?

EN:
Are you calling to unblock your account? Please answer with yes or no.
```

{% endcode %}

{% hint style="info" %}
**TIP!** In the beginning, before customers get used to the new technology, it is a good idea to provide short instructions on how to interact with the voicebot to avoid confusion and the tendency to press buttons like with IVR.
{% endhint %}

<details>

<summary>Wrong practice examples:</summary>

* **We don't want to make the user talk over the digital assistant!** On the contrary, we want the user's answer to be as concise and clear as possible, and thus easily and reliably recognizable by the voicebot. It is therefore a good idea to make sure that each chatbot text contains only one clear and understandable question.
* Having two questions in one text also makes it **difficult to adjust the synthesis,** as the **question mark is naturally followed by a longer pause** before the start of the next sentence, which is not correctable by the SSML breaktime tag.\
  From the user's perspective, it looks as if the voicebot has already finished, the **user starts to answer and jumps in to talk over** the robot.
* It is important to **make sure that one voicebot's message does not contain two questions at the same time**, as this can cause confusion and complicate understanding between the bot and the user.\
  When multiple questions are included in the output message, it can **confuse** the user. They may not know which question to focus on or what the correct answer is.\
  This can cause the user to feel frustrated which can reduce communication effectiveness and make the user experience less enjoyable.

</details>

***

### **Multiple questions in a single message**

* If you ask the question at the end of the speech, the customer has already heard all the relevant information and is ready to respond.
* In general, people are more likely to retain the freshest information in their memory, i.e. the information they heard last (see theme).
* This will increase the likelihood that the customer will respond concisely and appropriately, and the voicebot will recognize everything correctly and provide the most complete answer to the customer's query.
* The question asked will indicate to the customer that it is time for him to start talking.

Supporting argument for positioning question at the end of the speech:

{% tabs %}
{% tab title="Good practice examples:" %}
{% code overflow="wrap" %}

```
CZ:
Chtěl bych vám nabídnout konzultaci s naším expertem. Ozve se Vám a zdarma Vám provede kalkulaci toho nejvýhodnějšího pojištění. Máte zájem? 

Rádi bychom Vám poskytli konzultaci zdarma. Ozve se Vám náš expert s kalkulací, které pojištění by pro Vás bylo nejvhodnější. Souhlasíte?
```

{% endcode %}
{% endtab %}

{% tab title="Bad practice examples:" %}
{% code overflow="wrap" %}

```
CZ:
Máte zájem o poskytnutí konzultace? Ozval by se Vám náš expert, který zdarma provede kalkulaci nejvýhodnějšího pojištění Vám na míru.
```

{% endcode %}
{% endtab %}
{% endtabs %}

Rather, the solution is built on alternating between periods when the voicebot is speaking and not listening (running text-to-speech synthesis) and when the voicebot is silent, listening and evaluating the transcript of the response (running speech-to-text transcription). **If the customer speaks without the voicebot finishing speaking, speech-to-speech transcription does not run** and part of the response is lost, which can lead to misrecognition of intent.

It's better if the text of the voicebot contains the question **at the end of the speech.** This will help to prevent the customer from jumping into the voicebot's speech. The solution is not designed to allow the customer to interrupt the virtual assistant, while the voicebot is able to go back and finish the rest of its speech.

{% hint style="info" %}
Remember that achieving natural and expressive intonation in TTS systems can be challenging, as it requires capturing the nuances of human speech. It may take some **experimentation and refinement** to achieve the **desired results.**
{% endhint %}

***

### Position of the question in digital assistant's message

```
CZ:
Zkusíme to znovu? 
Vyplnil jste všechny údaje správně?
Přejete si reklamaci řešit raději písemnou cestou?
```

A question should never be phrased negatively. This is because people are generally more willing to accept positive information and ideas than vice versa. Phrasing the question negatively can reduce the likelihood that the voicebot will correctly understand the customer's answer.

```
CZ:
Nechcete to zkusit znovu?      
Neudělali jste chybu? 
Jste si jistý, že jste neudělal chybu?
Nepřejete si reklamaci raději řešit písemnou cestou? 

SK:
Nechcete to skúsiť znova?   
Neurobili ste chybu?  
Nechceli by ste svoju sťažnosť riešiť radšej písomne?

EN:
Won't you try again?
Don't you want to deal with reclamation via e-mail?
```

What would answer *yes* (*ano/áno*) mean in this case? *Yes, I want* or *Yes, I indeed don't want to?* This situational context cannot be discerned with 100% confidence, so we recommend phrasing questions positively or neutrally.

**Question wording is a key** element of the voicebot scenario. The wording of the question influences the phrasing of customer's answer, and therefore the intent that must be trained for the virtual assistant to recognize the answer.

#### Question formulation

Based on the length of the sentence sometimes the quality of the synthesis can be quite fluctuating, appearing artificial and not very natural. In such cases, it is useful to shorten the copywriting or add or edit some words to the text to make the voice synthesis sound better. This approach can help create a more natural and beautiful voice synthesis without the need for complicated settings or SSML tags.

<details>

<summary><mark style="color:green;"><strong>10 Copywriting tips for your digital assistant with speech synthesis</strong></mark></summary>

1. The copywriting must be snappy, clear, and understandable.
2. Speak the language of your customers. Use terms and slang they understand.
3. Never phrase questions negatively.
4. 1 message node = 1 question maximum.
5. The ideal placement of the question is at the end of the message.
6. Never ask two different things with one question.
7. For multiple-choice questions, make sure to clearly differentiate the options. If appropriate, rephrase to an announcement sentence and inform the customer that they have to choose and list options.
8. Spelling and grammatical correctness in *Speech* is secondary. However, make errors consciously so that they benefit the quality of the synthesis.
9. Avoid foreign language expressions whenever possible. Alternatively, we write them in such a way that they can be read alphabetically&#x20;
10. Special characters and cases that are to be read aloud are broken down with words.

</details>

***

## Intonation best practice

Master the art of customizing speech synthesis intonation with SSML on our [dedicated documentation page](#speech-synthesis-markup-language-ssml). Learn to personalize your AI voice with precision and creativity for a truly tailored and engaging auditory experience!

<details>

<summary>Here are some <strong>basic tips</strong></summary>

1. **Understand the context**: Intonation conveys meaning and emotion in speech. It's important to consider the context and intended message of the text. Identify the keywords, phrases, or sentences that require specific intonation patterns to convey the desired emphasis or emotion.
2. **Use punctuation**: Punctuation marks such as commas, periods, question marks, and exclamation marks indicate natural breaks and changes in intonation. Make sure to add appropriate punctuation to your text to guide the TTS system's intonation.
3. **Prosody tags**:  Utilize SSML tags to explicitly specify the desired intonation patterns for specific words or phrases.
4. **Experiment with pitch and duration**: Intonation involves variations in pitch and duration. Adjusting the pitch can create rising or falling intonation patterns while manipulating the duration of syllables or phrases can add emphasis or rhythmic patterns. Experiment with these parameters to achieve the desired intonation.
5. **Listen and iterate**: After applying intonation modifications, listen to the generated speech and evaluate the effectiveness of the intonation patterns. Make adjustments as needed to achieve the desired expressive quality and convey the intended meaning.
6. **Consult native speakers**: If possible, seek feedback from native speakers of the target language to ensure that the intonation sounds natural and appropriate. Native speakers can provide valuable insights and guidance on the intonation patterns specific to the language and context.

</details>

{% hint style="info" %}
Remember that achieving natural and expressive intonation in TTS systems can be challenging, as it requires capturing the nuances of human speech. It may take some **experimentation and refinement** to achieve the **desired results.**
{% endhint %}

***

### Intonation curve

Text-to-speech synthesis is also able to **automatically detect the sentence type and set the corresponding intonation curve**. This means that you can easily create a synthesized voice that sounds natural and matches the text you enter accurately.&#x20;

{% hint style="success" %}
**Pro-tip!** :sparkles: For neural voices provided by Microsoft Azure, feel free to use the [Audio content creator tool](https://speech.microsoft.com/audiocontentcreation) in Azure Speech Studio. Fine-tune synthesized speech audio to fit your scenario. Define lexicons and control speech parameters such as pronunciation, pitch, rate, pauses, and intonation
{% endhint %}

{% tabs %}
{% tab title="Video example" %}

<figure><img src="https://4261467870-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6iQTvxgRZRwPS1NgIGEb%2Fuploads%2FfiI6Q1OC8YaGBL2alByj%2Fimage.png?alt=media&#x26;token=2a0f23fd-55fd-4b5d-927e-d13a0b59800f" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Azure intonation edit" %}

<figure><img src="https://4261467870-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6iQTvxgRZRwPS1NgIGEb%2Fuploads%2F9LYtN545cIcJCabeBLyT%2Fimage.png?alt=media&#x26;token=a6eac3e6-2045-4bb8-917d-49250e77f23f" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Step by step guide" %}

* In the dialogue window, you can see the **width of each word segment**(x-axis) on the intonation curve.
* Keep in mind that **longer words contain more vowels** and therefore more opportunities for the intonation curve to be modified.
* By default, the intonation curve is represented by a straight line at 0%. However, this **does not mean** that the **basic intonation is completely flat!** The tool allows you to adjust the pitch **relative** to the automatic basic synthesis.
* A maximum of **five intonation points** can be plotted on the curve for a single section.
  {% endtab %}
  {% endtabs %}

In Azure Audio Content Creator, you can adjust intonation **in sections,** which means you can set intonation for each **sub-section, phrase, or word separately**, instead of having to set intonation for the whole sentence. This allows you to capture different intonation nuances more accurately and gives you more control over how the neural voice will appear.

![](https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-747493a5-3e89-4586-a550-8b73d5b99d03.png\&download=false\&resolveLfs=true&%24format=octetStream\&api-version=5.0-preview.1\&sanitize=true\&versionDescriptor.version=wikiMaster)

**Accentuating intonation** allows listeners to distinguish between the different ideas and information contained in a sentence. By adjusting intonation curves, you can **highlight important points or emphasize changes in mood or emotion**, which will help the listener better understand and remember what the voice is saying.

<details>

<summary><mark style="color:green;"><strong>Intonation best practice tips</strong></mark></summary>

Here's how you can work with the Intonation curve effectively:

* **Understand the Intonation curve**: The Intonation curve represents the pitch contour of the speech waveform. It visualizes the changes in pitch over time. The x-axis represents time, and the y-axis represents pitch.
* **Identify key points:** Identify the key points in the text where you want to manipulate the intonation. These points could include emphasized words, important phrases, or sections that require specific intonation patterns.
* **Add anchor points:** Add anchor points on the Intonation curve to indicate the pitch changes. Click on the curve at specific time points to create anchor points. These points will serve as the reference for manipulating the pitch.
* **Adjust pitch and duration:** Drag the anchor points up or down to adjust the pitch at those time points. Moving them upward raises the pitch while moving them downward lowers the pitch. You can also drag the edges of the anchor points to modify the duration of the pitch change.
* **Create prosody patterns:** By adding multiple anchor points and adjusting their positions, you can create inflection patterns such as rising, falling, or fluctuating intonation. For example, a rising intonation pattern can indicate a question, while a falling intonation pattern can denote a statement or completion.
* **Preview and refine:** Preview the speech with the modified Intonation curve to evaluate the impact of your changes. Fine-tune the positions of anchor points as needed to achieve the desired intonation patterns.
* **Iterate and experiment:** Intonation patterns can be subjective and depend on the context and language. Experiment with different anchor point positions, shapes, and durations to find the most appropriate intonation for your specific text.

</details>

{% hint style="info" %}
When adjusting intonation curves, you should consider several factors such as sentence length, rhythm, accent, and the emphasis you want to convey. It is important to remember that intonation is a complex subject and that you need to practice and try different approaches.
{% endhint %}

***

### Melody patterns

Here are some examples of inflexion patterns you can achieve:

{% tabs %}
{% tab title="Rising intonation" %}
To create a rising intonation pattern, you would place anchor points at the beginning of a phrase or sentence and gradually raise the pitch as you move forward in time. This pattern is commonly associated with questions or uncertainty.
{% endtab %}

{% tab title="Falling intonation" %}
A falling intonation pattern involves placing anchor points at the beginning of a phrase or sentence and gradually lowering the pitch as you move forward in time. This pattern often signals a statement or completion.
{% endtab %}

{% tab title="Fluctuating intonation" %}
You can create a fluctuating intonation pattern by adding anchor points at various positions along the Intonation curve. This pattern involves both rising and falling pitch movements, conveying emphasis or highlighting important information.
{% endtab %}

{% tab title="Plateau intonation" %}
A plateau intonation pattern maintains a relatively steady pitch without significant changes. You would position anchor points at similar heights along the curve, resulting in a flat or level pitch. This pattern is often used for conveying neutral or declarative statements.
{% endtab %}
{% endtabs %}

**Want more detailed tips** :interrobang:

<details>

<summary><span data-gb-custom-inline data-tag="emoji" data-code="1f1e8-1f1ff">🇨🇿</span> CZ intonation patterns step-by-step</summary>

#### Basic patterns in Czech language

**Rising intonation for questions**: In Czech, rising intonation is commonly used to indicate questions. When using the Intonation curve, create a rising pattern by gradually increasing the pitch from the beginning to the end of the question phrase or sentence. To emphasize the end of the question, you can drop the curve a little bit lower before you bring it up to its amplitude.\
![image.png](https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-3576db06-e5f5-49a3-8976-f3c194c76ba3.png\&download=false\&resolveLfs=true&%24format=octetStream\&api-version=5.0-preview.1\&sanitize=true\&versionDescriptor.version=wikiMaster)

**Falling intonation for statements**: Statements in Czech typically have falling intonation. To convey this, place anchor points at the beginning of the statement and gradually lower the pitch as you move forward in time. This falling pattern gives a sense of finality and completion to the sentence.\
![image.png](https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-9e89fbe0-c938-40e7-95c0-37b1991307bf.png\&download=false\&resolveLfs=true&%24format=octetStream\&api-version=5.0-preview.1\&sanitize=true\&versionDescriptor.version=wikiMaster)

**Distinctive pitch accents**: In Czech, pitch accents do not typically change the meaning of a word. Unlike some tonal languages where pitch variations can differentiate lexical meanings, Czech does not have a lexical tone. Instead, Czech is characterized by patterns of stress and intonation.\
![image.png](https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-666ed0e4-44ee-4269-84de-b419a3d2a4ca.png\&download=false\&resolveLfs=true&%24format=octetStream\&api-version=5.0-preview.1\&sanitize=true\&versionDescriptor.version=wikiMaster)

**Emphasizing keywords**: In Czech, emphasis is often placed on specific words to highlight their importance or to contrast them with other elements in the sentence. Use the Intonation curve to create a noticeable pitch increase on the emphasized word or phrase. This helps convey the intended emphasis and focus within the sentence.\
![image.png](https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-e9f1dd2b-c31c-4937-b8d6-d39246ec2694.png\&download=false\&resolveLfs=true&%24format=octetStream\&api-version=5.0-preview.1\&sanitize=true\&versionDescriptor.version=wikiMaster)

**Pay attention to sentence structure**: The word order and sentence structure in Czech can influence intonation patterns. For example, the initial position of a subject in a sentence may receive more prominent intonation. Be mindful of these structural cues and adjust the intonation accordingly.

<br>

#### Question intonation

If we want to improve intonation to make it clear that this is a question, the first instinct should be to set a curve rising with the end of the sentence.

<img src="https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-7c2b6fad-7b6d-42b9-b50e-7032bb05bbe6.png&#x26;download=false&#x26;resolveLfs=true&#x26;%24format=octetStream&#x26;api-version=5.0-preview.1&#x26;sanitize=true&#x26;versionDescriptor.version=wikiMaster" alt="image.png" data-size="original">

There are also some specific words that **indicate a question,** such as **interrogative pronouns** (what, who, where, etc.). If we **raise the intonation** on the curve at this point, we will e**mphasize these words**. This can be useful if we want to elicit specific information from users. This will be more effective for multi-word questions. For very short questions, or questions made up of very short words, there are not enough vowels available to allow the melody of the more complex intonation curves to "sing" the melody correctly.

***Proč** máš tak veliké zuby?* - Emphasize we're looking *for reason.*

***Kolik** vlasů má pohádkový dědeček Vševěd?* - Emphasize we're looking *for count*.

***Jak** se podle vás správně budí princezny?* - Emphasize we're looking *mean of*.

\
This will be **more effective for longer questions.** For very short questions or questions made up of very short words, there are not enough vowels available to allow the more complex intonation curves to get the melody right.

*Jaká je adresa tvého trvalého bydliště?* ✓✓✓\
\&#xNAN;*Řekneš mi, kde bydlíš?* ✓✓\
\&#xNAN;*Kde domov můj?* ✓\
\&#xNAN;*Jak se máš?* ✓\
\&#xNAN;*Jak je?* ✗

\
Using SSML, you can also transform the original statement into a question. The melody of the sentence will probably appear "flatter" compared to a classically written question with a question mark at the end, which is rising by default.

**How to correct excessive pitch for Czech neural voices**

Sometimes the neural voice may seem to intone the ends of questions correctly but unnaturally. Typically, the end of a sentence suddenly shoots off significantly in intonation, the voice seems " textbook-ish", as if it mutates, sounds affected or strangled.

<img src="https://68.media.tumblr.com/a74ebcd493197c60f20743a0d3058424/tumblr_nf65iqOirF1sfmnojo1_500.gif" alt="abc.gif" data-size="original">

**If the pitch of synthetic speech seems to rise too much at the end of questions,** there are a few things you can try.

* [x] &#x20;Check the text to synthesize, and if necessary, try adjusting the copywriting.

*Už máte sjednané nové pojištění?* → Prosím řekněte mi, zda již máte sjednané nové pojištění.

* [x] &#x20;Consider adjusting the pitch of the synthesized sentence using the Pitch feature. This allows you to move the pitch of the synthesized speech up or down. In this case, adjust the index downwards by a few percent (e.g. 1 → 0.98).

<img src="https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-eb1456d4-ba72-4049-ab88-f5712d335846.png&#x26;download=false&#x26;resolveLfs=true&#x26;%24format=octetStream&#x26;api-version=5.0-preview.1&#x26;sanitize=true&#x26;versionDescriptor.version=wikiMaster" alt="image.png" data-size="original">

* [x] &#x20;Another option is to adjust the intonation curve and reduce the exaggerated pitch at the end of the sentence. Enter the intonation point a few per cent lower than the default value.

<img src="https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-3588f291-7d3a-46d5-94f4-bb394f7b001e.png&#x26;download=false&#x26;resolveLfs=true&#x26;%24format=octetStream&#x26;api-version=5.0-preview.1&#x26;sanitize=true&#x26;versionDescriptor.version=wikiMaster" alt="image.png" data-size="original">

</details>

***

### Pauses and breaks best practice

#### Basics

The **\<break>** tag is used to create a pause of a given length in the text. This tag can be used, for example, to express a pause between words or sentences, which can help to improve the naturalness of the delivery.&#x20;

{% hint style="info" %}
Azure Audio content creator tool offers both predefined tags and the ability to incorporate breaks of a length of our choosing.
{% endhint %}

{% tabs %}
{% tab title="Break tag" %}

The **\<break> tag in SSML** is used to insert a pause or a break in the speech output. It allows you to specify the duration and the strength of the pause. The \<break> tag can be used with the following attributes:

*<mark style="color:purple;">time:</mark>*\
Specifies the duration of the pause in seconds or milliseconds. For example, \<break time="500ms"/> inserts a pause of 500 milliseconds.

<figure><img src="https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-c44ee74a-f216-43fa-8619-1b0dba23703f.png&#x26;download=false&#x26;resolveLfs=true&#x26;%24format=octetStream&#x26;api-version=5.0-preview.1&#x26;sanitize=true&#x26;versionDescriptor.version=wikiMaster" alt=""><figcaption></figcaption></figure>

*<mark style="color:purple;">strength:</mark>* \
Specifies the strength or intensity of the pause. It can have values like "none", "x-weak", "weak", "medium", "strong", or "x-strong". The actual interpretation of these values may depend on the specific text-to-speech engine used.

<figure><img src="https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-0cd49bd6-f2c3-47e9-9efa-f1a9080e6cf5.png&#x26;download=false&#x26;resolveLfs=true&#x26;%24format=octetStream&#x26;api-version=5.0-preview.1&#x26;sanitize=true&#x26;versionDescriptor.version=wikiMaster" alt=""><figcaption></figcaption></figure>

#### Example

* Hello, <mark style="color:blue;">\<break</mark> <mark style="color:purple;">time="500ms"</mark><mark style="color:blue;">/></mark> how are you today?
* Hello, <mark style="color:blue;">\<break</mark> <mark style="color:purple;">strenght="medium"</mark><mark style="color:blue;">/></mark> how are you today?
  {% endtab %}

{% tab title="Pause - sentences" %}
A normal punctuation dot is an interruption of approximately **200-300 ms.** If you find these pauses too long, we recommend replacing the punctuation with a shorter break. Ideally **100 ms, but never less than 50 ms.**

![Each interpuctional period is replaced by pase of 100 miliseconds.](https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-a92f2dca-62ef-481e-9559-eec9c01cb30a.png\&download=false\&resolveLfs=true&%24format=octetStream\&api-version=5.0-preview.1\&sanitize=true\&versionDescriptor.version=wikiMaster)

{% hint style="warning" %}
The punctuation must **be replaced** by SSML break! Putting a break tag after punctuation would paradoxically make the pauses even longer (default punctuation pause length 300 ms + additional break tag length).
{% endhint %}

By setting **different pause lengths,** we can influence the **rhythm of speech**. For example, set a shorter pause between sentences that form a single thought than at the front of different thoughts. Millisecond differences in pause length are almost imperceptible to the conscious human. However the different **impression of synthesis is mainly created by the change of rhythm between sections**. We can take advantage of this and better separate the individual information.

<figure><img src="https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-2424abcd-9596-417f-9489-cce96eeabfbb.png&#x26;download=false&#x26;resolveLfs=true&#x26;%24format=octetStream&#x26;api-version=5.0-preview.1&#x26;sanitize=true&#x26;versionDescriptor.version=wikiMaster" alt=""><figcaption></figcaption></figure>

<figure><img src="https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-e21e0f02-3d49-406a-ae24-fc94cb9aa7f6.png&#x26;download=false&#x26;resolveLfs=true&#x26;%24format=octetStream&#x26;api-version=5.0-preview.1&#x26;sanitize=true&#x26;versionDescriptor.version=wikiMaster" alt=""><figcaption></figcaption></figure>

In the example shown in the figure above, the texting can be divided into the sections *<mark style="color:yellow;">acknowledgement</mark> → <mark style="color:green;">argumentation</mark> → <mark style="color:blue;">call to action</mark> → <mark style="color:red;">confirmation question</mark>.* The argumentation consists of several sentences that form a unified train of thought, so it contains shorter pauses. On the other hand, a **100ms** gap between sections is **more distinctive.**
{% endtab %}

{% tab title="Pause - words" %}

The **break also affects the intonation** of the sentence and acts as an intonation dot, i.e. the intonation curve drops at the end.

Sometimes, however, we need to edit a pause between words or parts of sentences without intonating a full stop after the sentence. A **softer drop** in voice and at the same time a pause are provided by other signs, the most prominent of which is the punctuation mark, and slightly more subtle are the **colon**, **semicolon** or **dash**.

> > **Prominent pauses:**\
> > Něco končí \<break strengh = "weak"/> Něco začíná.\
> > Něco končí \<break time = "100ms" /> Něco začíná.\
> > Něco končí. Něco začíná.

> > **Softer pauses:**\
> > Něco končí; něco začíná.\
> > Něco končí, něco začíná\
> > Něco končí: něco začíná.\
> > Něco končí- něco začíná.

It is recommended that these marks be added at the point where the voice drop and short pause are to occur, or that they replace the original punctuation. If the pause is too short, write several of these signs in a row, adding their lengths together.

> > Něco končí,,, něco začíná.\
> > Něco končí;; něco začíná.

#### Example

![](https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-46094c6c-2213-4d87-959b-7174f730d23f.png\&download=false\&resolveLfs=true&%24format=octetStream\&api-version=5.0-preview.1\&sanitize=true\&versionDescriptor.version=wikiMaster)

Note that we have applied **multiple strategies** to improve the quality of the synthesis in the example.

1. We have **rewritten** the graphical distinction of the options with words (**A. ---> option A**);
2. **Replacing punctuation** marks with a **break tag** of a chosen length;
3. **Combining additional punctuation marks** between words for finer and shorter pauses with less effect on intonation.
   {% endtab %}

{% tab title="Pauses - after question" %}
The **question mark** serves as a punctuation mark that not only indicates a question but **also functions as a pause.** When a question is posed, there is typically a brief interruption of a**round 300 milliseconds.**

However, if the question is followed by additional text, this pause can disrupt the flow of speech and give the impression that the voicebot has finished speaking. To prevent the voicebot and the user from overlapping in their speech, it is important to consider certain strategies for phrasing questions and placing them appropriately within the speech. The chapter on copywriting delves into these strategies in more detail.

Let's review some fundamental principles of good practice:

* **Each message** should contain only **one question.**
* **Each question** should focus on a **single topic**. For example, instead of asking, "Would you like to create a new account and join our Premium Club?" it is better to ask separate questions for each option.
* It is advisable to include the **question at the end of the message** so that the user has already received all the necessary information.
  {% endtab %}

{% tab title="A/ B question" %}
Let's address one specific case of choice questions (e.g., "Do you want A or do you want B?").&#x20;

In this scenario, it is crucial to ensure good synthesis and **emphasize distinctive intonation to convey the question's intent clearly and elicit the expected response.** One recommendation is to structure the sentence in a way that presents each option as a separate sentence.

To assess the quality of synthesis, intonation, and the initial impression the question makes, **experiment with different writing styles and punctuation**. It is possible that breaks or punctuation marks may not provide a sufficiently distinct separation between the two options. In such cases, trying out a question mark after each option could be worth considering.

<br>

<figure><img src="https://dev.azure.com/borndigitalai/578a80a1-7788-4722-bd5f-183c0f413723/_apis/git/repositories/3d058504-fd4a-4522-8a49-88ccfc042356/Items?path=/.attachments/image-cfaad430-b16b-40bd-85ad-b577f63f9bff.png&#x26;download=false&#x26;resolveLfs=true&#x26;%24format=octetStream&#x26;api-version=5.0-preview.1&#x26;sanitize=true&#x26;versionDescriptor.version=wikiMaster" alt=""><figcaption></figcaption></figure>

Please note that this is not a violation of the rule stating that only one question should be included in each text. It's important to **remember that speech synthesis input notation serves a different purpose** than regular text and does not necessarily adhere to grammar or spelling conventions. In this instance, it is considered a single question, with each option represented by a distinct intonation pattern that visually resembles a question mark.

In the case of *"Do you want coffee? Or do you want tea?"* a pause between the two options might **be excessively long**. Unfortunately, this gap cannot be shortened using the break tag. One possible approach is to remove the space after the first question mark *(e.g., "Do you want coffee?Or do you want tea?").* In certain cases, **combining the options into one continuous string can be helpful**, with the question mark indicating the intonation but without the unwanted extended pause.

However, **note** that this is just **one of the potential solutions and not the only correct way** to adjust the intonation in such cases.
{% endtab %}
{% endtabs %}

***

## Pronunciation best practice

### Phoneme-defined pronunciation

{% tabs %}
{% tab title="Stress and accent" %}
**Stress** can have a huge impact on pronunciation.&#x20;

When a word is stressed, the stressed syllable is typically pronounced with greater intensity, higher pitch, and longer duration compared to unstressed syllables. Additionally, **the quality of vowels in stressed syllables may also be affected, with stressed vowels often being pronounced with more clarity and fullness.**

In :flag\_cz:Czech, **stress is generally fixed on the first syllable** of a word. This means that the first syllable receives the primary stress, while the subsequent syllables have secondary stress or are unstressed. However, it's important to note that stress patterns can vary depending on the word and its inflectional or derivational forms.

To provide some examples in IPA (International Phonetic Alphabet), let's consider a few Czech words:

> > "kniha" (book): `/ˈkɲɪɦa/`\
> > The primary stress falls on the first syllable (/kɲ/), making it more prominent in pronunciation.

> > "univerzita" (university): `/ˌunɪvɛrˈzɪta/`\
> > The primary stress is on the second syllable (/ɪv/), while the first syllable (/u/) carries secondary stress. The following syllables (/ɛrˈzɪt/ and /ta/) are unstressed.

> > "překvapení" (surprise): `/ˌpr̝̊ɛkvaˈpɛɲi/`\
> > The primary stress is on the second syllable (/ɛkva/), and the first syllable (/pr̝̊/) carries secondary stress. The final syllable (/ɲi/) is unstressed.

These examples illustrate the general stress patterns in Czech, where the stressed syllables are emphasized in terms of intensity, pitch, duration, and sometimes vowel quality. It's important to consult native speakers or audio resources to further refine your pronunciation and understand the intricacies of Czech stress patterns.
{% endtab %}

{% tab title="Word borrowings" %}
When incorporating **English** words into the :flag\_cz: Czech language, the stress patterns of these borrowed words tend to follow the **stress patterns of Czech words**. However, there may be some adjustments to fit the Czech stress rules. Here are a few guidelines for handling English words in Czech:

* **First-syllable stress:** As mentioned earlier, Czech generally has stress on the first syllable of a word. When adopting English words, the primary stress is often placed on the first syllable to align with Czech stress patterns.

> > Example: "computer" in English has stress on the second syllable (`/kəmˈpjuːtər/`), but in Czech, it would typically be pronounced with stress on the first syllable: `/ˈkɔmpjutr/`.

* **Adaptation of vowel sounds:** English vowels can have different qualities compared to Czech vowels. When adapting English words into Czech, the vowel sounds may be modified to match the Czech vowel inventory.

> > Example: "restaurant" in English (`/ˈrɛstərɒnt/`) could be adapted in Czech as /ˈ`rɛstaurant`/.

* **Retaining original stress:** In some cases, English words may retain their original stress patterns, particularly when they are relatively recent borrowings or specialized terms that have become familiar to Czech speakers.

> > Example: "hotel" in English has stress on the first syllable (`/hoʊˈtɛl/`), and this stress pattern is often preserved when using the word in Czech: `/ˈhotɛl/`.

&#x20;:woman\_teacher: Adaptation of English words in Czech can vary depending on individual preferences, language register, and the familiarity of the borrowed word to Czech speakers. Therefore, there can be some variability in how English words are pronounced within the Czech language context.
{% endtab %}
{% endtabs %}

### Foreign words pronunciation

When using Azure's Speech Studio with neural voices to handle foreign word pronunciation, here are some tips to ensure accurate pronunciation:

* **Phonetic spelling**: Provide a phonetic spelling of the foreign words using the International Phonetic Alphabet (IPA) or a transcription system familiar to the base language neural voices. This helps the TTS system understand the correct pronunciation of the word.
* **Lexicon customization**: Utilize the lexicon customization feature in Azure's Speech Studio to add pronunciation rules for specific foreign words. This allows you to specify the pronunciation of each word or phrase more precisely.
* **Pronunciation rules:** Create pronunciation rules for common patterns found in foreign words. For example, if there is a consistent pattern of stress in the foreign language, you can define rules to apply stress in the appropriate position.
* **Contextual cues**: Provide additional context within the text to help guide the TTS system's pronunciation. This could include nearby words or phrases that assist in determining the correct pronunciation of the foreign word.
* **Test and iterate**: After applying the above techniques, listen to the generated speech and identify any mispronunciations. Adjust the phonetic spellings, lexicon entries, or pronunciation rules as necessary and continue testing until the desired pronunciation is achieved.

It's important to note that while these tips can improve the accuracy of foreign word pronunciation in TTS systems, the results may still vary. TTS systems are trained on large datasets and generalize pronunciation based on the language's phonetic patterns. Handling foreign words can be challenging due to the diverse pronunciation rules across languages.

{% tabs %}
{% tab title="CZ - Foreign words" %}
:flag\_cz: Examples:

***

Some phrases or individual words from foreign vocabulary are trained in the default text-to-speech model and synthesized is smooth, localized to base language, and pleasant to the ear. Sounds fine without adjustments:

* *Nejpoužívanější vyhledávač v Česku je Seznam, nikoliv **Google.***
* *Letíme na dovolenou se společností **Lufthansa.***
* *Koupím ojetý **renault.***

Other phrases or words can be very similar and comprehensible, with a few tweaks here and there. Even though their pronunciation is nearly correct, this can cause an uncanny valley effect:

* *Ceny maji jako **Deutsche bahn**, ale služby jako nejposlednější drožka.*

> > **deutsche** is pronounced correctly like `[dɔ͡ɪˈt.ʃɛ.]`, but **bahn** sounds like **`[ba.ɦaːˈ.ẽˈ]`** and would be needed to be adjusted

* *Potřebuji znát vaši **IP** adresu.*

> > **IP** being pronounced like `[iːˈ.pɛː]`, which would be comprehensible, but in the case would be better to adjust English pronunciation to Czech as `[a͡j.piː]`\
> > 🇨🇿Pronunciation of numbers followed by currency signs is quite different from common reading rules.With prices, sums of money, or values, we often omit words describing the decimal order of fractional part (`[desetiny, setiny]`)
> > {% endtab %}

{% tab title="CZ - Currency, prices, money sums pronunciation" %}

> > 19,99 - `[devatenáct celých, devadesát devět setin]` 19,99 Kč - `[devatenáct korun devadesát devět haléřů]` v akci jen za 19,99 Kč - `[v akci jen za devatenáct devadesát devět]` rohlík stojí 1,50,- - `[rohlík stojí korunu padesát]`

Azure's speech synthesis recognizes some words or signs for currency

> > **A.** Czech currency 100 Kč `[sto českých korun]` 1000 CZK `[tisíc českých korun]`

> > **B.** Dollars 10 $ `[10 dolarů]` 3 $ `[3 dolary]`

> > **C.** Euros 3 € `[3 eura]` 1000 € `[tisíc eur]` 1 € `[jedno euro]`

> > **D.** British pounds. 1 £ `[jedna libra]` 3 £ `[tři libry]` 10 £ `[deset liber]`

> > **E.** Not all international currency symbol are supported (besides few most common they're usually not)
> > {% endtab %}

{% tab title="CZ  - Digits and numbers pronunciation" %}

####

Pronunciation of digits depends on:

* length of a digit string
* input format
* language and neural persona
* SSML rules
* reading rules and their localization
* numeric type (integer, float, ordinal)
  {% endtab %}
  {% endtabs %}

### **Integers**

{% tabs %}
{% tab title="CZ " %}
:flag\_cz: Czech language

* If digits have **fewer than or exactly 6 digits**, they are always **read decadically as default**, regardless of whether a space separates orders of thousands\
  \
  1234\
  1 234\
  `Both numbers are pronounced as: TISÍC DVĚ STĚ TŘICET ČTYŘI`\
  12345\
  12 345\
  `Both numbers are pronounced as: DVANÁCT TISÍC TŘI STA ČTYŘICET PĚT`\
  123456\
  123 456\
  `Both numbers are pronounced as: STO DVACET TŘI TISÍC ČTYŘI STA PADESÁT ŠEST`<br>

* If digits have **7 or more characters**, they are read **decadically** only if the order of thousands is **separated by a space.** Numeric string notation **without spaces defaults to reading each digit in turn.**\
  \
  1234567 `is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST SEDM`\
  1 234 567 `is pronounced MILION DVĚ STĚ TŘICET ČTYŘI TISÍC PĚT SET ŠEDESÁT SEDM`\
  123456789 `is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST SEDM OSM DEVĚT`\
  123 456 789 `is pronounced STO DVACET TŘI MILIONÚ ČTYŘI STA PADESÁT ŠEST TISÍC SEDM SET OSMDESÁT DEVĚT`

* If we need **shorter numbers to be read each digit in turn**, there are several ways to do it\
  **A.** Add spaces in between\
  1 2 3 4 5 6 `is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST`\
  \
  **B.** Add commas in between\
  1, 2, 3, 4, 5, 6 `is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST with more distinct pauses between each of them`\
  \
  **C.** Use SSML alias for spelling (hláskování)\
  \<say-as interpret-as="spell" format="undefined">123456\</say-as> `is pronounced JEDEN DVA TŘI ČTYŘI PĚT ŠEST`

* If we need **longer number to be read each digit in turn**, copywriting input needs to be adapted accordingly\
  \
  **A.** Use numeric string without spaces\
  1234567890\
  \
  **B.** Separate each digit with spaces\
  1 2 3 4 5 6 7 8 9\
  \
  **C.** Separate each digit with interpunction\
  1,2,3,4,5,6,7,8,9\
  \
  These numbers will be pronounced as `JEDEN DVA TŘI ČTYŘI PĚT ŠEST SEDM OSM DEVĚT`
  {% endtab %}

{% tab title="SK" %}
:flag\_sk: Slovak language

* If digits have **fewer than or exactly 6 digits**, they are always r**ead decadically as default**, regardless of whether a space separates orders of thousands.

> > 123456\
> > 123 456\
> > `Both numbers are pronounced as: stodvadsaťtri tisíc štyristo päťdesiatšesť`

* If digits have **7 or more characters**, they are read **decadically** only if the order of thousands is **separated by a space.** Numeric string notation **without spaces defaults to reading each digit in turn.**

> > 1234567 `is pronounced [jedna dva tri štyri päť šesť sedem]`\
> > 1 234 567 `is pronounced [jeden milión dvestotridsaťštyri tisíc päťsto šesťdesiatsedem]`

* **The maximal length of spaced numerical string which is pronounced decadically is 15 digits** (10^14), in EN system for hundreds of trillions, in SK **stovky bilionov**.

> > 111 222 333 444 555\
> > `is pronounced [sto jedenásť biliónov ...]`

* Spaced numerical strings **longer than 15 digits are pronounced as multiple numbers,** each containing 15 digits of less

> > 111 222 333 444 555 666 will be divided as 111 222 333 444 555 | **666**\
> > `[sto jedenásť biliónov ... päťstopäťdesiatpäť | šesťstopäťdesiatšesť`

* If we need **shorter numbers to be read each digit in turn**, there are several ways to do it

> > Add spaces in between\
> > 1 2 3 4 5 6 `is pronounced jedna dva tri štyri päť šesť`

> > Add commas in between\
> > 1, 2, 3, 4, 5, 6 `is pronounced jedna dva tri štyri päť šesť with more distinct pauses between each of them`

> > Use SSML says-as for spelling (hláskování)\
> > \<say-as interpret-as="spell" format="undefined">123456\</say-as> `is pronounced jedna dva tri štyri päť šesť`

* If we need **longer number to be read each digit in turn**, copywriting input needs to be adapted accordingly

> > Use numeric string without spaces\
> > 1234567890

> > Separate each digit with spaces\
> > 1 2 3 4 5 6 7 8 9

> > Separated each digit with interpunction\
> > 1,2,3,4,5,6,7,8,9

> These number will be pronounced as `jedna dva tri štyri päť šesť sedem osem deväť`

<br>
{% endtab %}

{% tab title="EN" %}
:flag\_gb: English language

* If digits have **fewer than or exactly 6 digits**, they are always r**ead decadically as default**, regardless of whether orders of thousands are separated by a space.

> > 1234\
> > 1 234\
> > `Both numbers are pronounced as: [one thousand two hundred thirty-four]`\
> > 12345\
> > 12 345\
> > `Both numbers are pronounced as: [twelve thousands three hundred and forty-five]`\
> > 123456\
> > 123 456\
> > `Both numbers are pronounced as:[one hundred twenty-three thousands four hundred fifty-six]`

* If digits have **7 or more characters**, they are read **decadically** only if the order of thousands is **separated by a space.** Numeric string notation **without spaces defaults to reading each digit in turn.**

> > 1234567 `is pronounced [one two three four five six seven]`\
> > 1 234 567 `is pronounced [one million two hundred twenty-three thousands five hundred sixty-seven]`

* **The maximum length of spaced numerical string which is pronounced decadically is 15 digits** (10^14), in EN system for hundreds of trillions.

> > 111 222 333 444 555\
> > `is pronounced [ one hundred and eleven trillion...]`

* Spaced numerical strings **longer than 15 digits are pronounced as multiple numbers,** each containing 15 digits of less

> > 111 222 333 444 555 666 will be divided as 111 222 333 444 555 | **666**\
> > `[one hundred and eleven trillion ... five hundred fifty five | six hundred sixty six]`

* If we need **shorter numbers to be read each digit in turn**, there are several ways to do it

> > Add spaces in between\
> > 1 2 3 4 5 6 `is pronounced [one two three four five six]`

> > Add commas in between\
> > 1, 2, 3, 4, 5, 6 `is pronounced one two three four five six with more distinct pauses between each of them`

> > Use SSML alias for spelling (hláskování)\
> > \<say-as interpret-as="spell" format="undefined">123456\</say-as> `is pronounced one two three four five six`

* If we need **longer number to be read each digit in turn**, copywriting input needs to be adapted accordingly

> > Use numeric string without spaces\
> > 1234567890\
> > `[one two three four five six seven eight nine zero]`

> > If all subsequent triplets are made with 000, the string is read decadically even without spaces
> >
> > * 100000 - `[one million]`
> > * 1100000 - `[eleven million]`
> > * 111000000 - `[one hundred eleven million]`
> > * 2000000000 - `[two billion]`
> > * 123000000000000 - `[one hundred twenty-three trillion]`

> > But when zeros cannot be divided into triplets, the numerical string is pronounced one digit at a time
> >
> > * 123100000000000 - `[one two three one zero zero zero ... ]`

* If we need **longer numbers to be read decadically**, divide orders of thousands, millions, billions, trillions by space

> > - 1 234 567 - `[one million to hundred thirty-four thousands five hundred sixty-seven]`
> > - 123 000 000 000 001 - `[one hundred twenty-three trillion and one]`

{% endtab %}

{% tab title="PL" %}
:flag\_pl: Polish language<br>

* If digits have **exactly 4 digits.**

> > **A.** If we have numbers 1000, 2000, 3000, etc. The numbers are read by Azure neural voices as:\
> > 1000 - `tysięczny rok` – year one thousand\
> > 2000 - `dwutysięczny rok` – year two thousand\
> > 3000 - `trzytysięczny rok` - year three thousand etc. - This is **WRONG.**\
> > \
> > It should be pronounced as:\
> > 1000 - `tysiąc`\
> > 2000 - `dwa tysiące`\
> > 3000 - `trzy tysiące`

This situation can also occur with other numbers, for example 4999 is pronounce as\
`cztery tysiące dziewięćset dziewięćdziesiąty dziewiąty rok` - this is **WRONG**.\
**CORRECT** is `cztery tysiące dziewięćset dziewięćdziesiąt dziewięć`.\
Every number needs to be checked!

> > **B.**

| **NUMBER** | **OK/NOK** |
| ---------- | ---------- |
| 1234       | OK         |
| 2234       | OK         |
| 3234       | OK         |
| 4234       | OK         |
| 5234       | NOK        |
| 6234       | NOK        |
| 7234       | NOK        |
| 8234       | NOK        |
| 9234       | NOK        |

1..., 2..., 3..., 4... (thousand) are pronounced **CORRECT**.

5..., 6..., 7..., 8..., 9... (thousand) are pronounced **WRONG**.

5324 – `pięć TYSIĄCE` - correct is `pięć TYSIĘCY`

6234 - `sześć TYSIĄCE` - correct is `sześć TYSIĘCY`

7324 – `siedem TYSIĄCY` - correct is `siedem TYSIĘCY`

8324 - `osiem TYSIĄCE` - correct is `osiem TYSIĘCY`

9324 - `dziewięć TYSIĄCE` - correct is `dziewięć TYSIĘCY`
{% endtab %}
{% endtabs %}

### Phone numbers

{% tabs %}
{% tab title="CZ" %}
:flag\_cz: Czech language

For the pronunciation of **telephone numbers:**

> > **A.** Write them down in iso format including prefix\
> > Volejte +420 800 148 148\
> > `is pronounced VOLEJTE PLUS ČTYŘI STA DVACET, OSM SET, STO ČTYŘICET OSM, STO ČTYŘICET OSM`

> > **B.** Separate custom sections with interpunction\
> > Volejte 800, 148, 148\
> > `is pronounced OSM SET, STO ČTYŘICET OSM, STO ČTYŘICET OSM`\
> > Volejte 212-456-789\
> > `is pronounced DVĚ STĚ DVANÁCT, ČTYŘI STA PADESÁT ŠEST, SEDM SET OSMDESÁT DEVĚT`\
> > Volejte 800, 54, 12, 12\
> > `is pronounced OSM SET, PADESÁT ČTYŘI, DVANÁCT, DVANÁCT`\
> > Volejte 800, 12, 7, 7, 7. 7\
> > `is pronounced OSM SET, DVANÁCT, SEDM, SEDM, SEDM, SEDM`

> > **C.** Write them down as alphabetical string\
> > Volejte osm set dvanáct čtyři sedmničky\
> > Volejte osm set dvanáct sedm sedm sedm sedm

> > **D.** Separate sections with spaces, as long it's meant to be pronounced in 3-2-2-2 or 3-2-1-1-1-1\
> > 800 54 12 12\
> > `is pronounced OSM SET, PADESÁT ČTYŘI, DVANÁCT, DVANÁCT`\
> > 800 54 1 2 1 2\
> > `is pronounced OSM SET, PADESÁT ČTYŘI, JEDNA, DVA, JEDNA, DVA`

<br>
{% endtab %}

{% tab title="SK" %}
:flag\_sk: Slovak language

For the pronunciation of **telephone numbers:**

> > **A.** Write them down in iso format including prefix\
> > Volajte +421 800 148 148\
> > `is pronounced VOLAJTE PLUS ŠTYRI DVA JEDEN, OSEMSTO, STOŠTYRIDSAŤOSEM, STOŠTYRIDSAŤOSEM`

> > **B.** Separate custom sections with interpunction\
> > Volejte 800, 148, 148\
> > `is pronounced OSEMSTO, STOŠTYRIDSAŤOSEM, STOŠTYRIDSAŤOSEM`\
> > Volejte 800-148-148\
> > `is pronounced OSEMSTO, STOŠTYRIDSAŤOSEM, STOŠTYRIDSAŤOSEM`\
> > Volejte 800, 54, 12, 12\
> > `is pronounced OSEMSTO, PÄŤDESIATŠTYRI, DVANÁSŤ, DVANÁSŤ`\
> > Volejte 800, 12, 7, 7, 7. 7\
> > `is pronounced OSEMSTO, DVANÁSŤ, SEDEM, SEDEM, SEDEM, SEDEM`

> > **C.** Write them down as alphabetical string\
> > Volajte na osemsto dvanásť štyri siedmičky\
> > Volajte na osemsto dvanásť sedem sedem sedem sedem

<br>
{% endtab %}

{% tab title="EN" %}
:flag\_gb: English language

For the pronunciation of **telephone numbers:**

> > **A.** Write them down in iso format including prefix\
> > Call +1-212-456-7890 (USA)\
> > `is pronounced [call plus one - two one two - four five six - seven eight nine zero]`\
> > Call +44 7911 123456 (UK)\
> > `is pronounced [call plus four four, seven nine one one, one two three four five six]`

> > **B.** For USA, obey domestic 3-3-4 format\
> > Call 212-456-7890\
> > `is pronounced [call one - two one two - four five six - seven eight nine zero]`

> > **C.** For the UK, obey area code formatting ( 3-4 digits + 8-7 digits)\
> > 020 (London) 1234 5676\
> > `is pronounced [zero two zero, one two three four, five six seven eight]`

> > **D.** To customize pronunciation, separate digits with spaces or punctuation\
> > Call 4 4 4 4, 1 2 3, 1 2 3\
> > `is pronounced [call four four four four, one two three, one two three] with more distinctive pauses between comma-separated sections`\
> > Call 800, 12, 34, 12, 34\
> > `is pronounced [call eight hundred, twelve, thirty- four, twelve, thirty-four]`\
> > Call 800 12 34 12 34\
> > `is pronounced [call eight hundred one two three four one two thirty-four`

> > **E.** For 10-digit phone number (prefix not included), you can use SSML alias with attribute\
> > \<say-as interpret-as="telephone" format="undefined">1234567890\</say-as>\
> > `[one two three four five six seven eight nine o]`

<br>
{% endtab %}
{% endtabs %}

### Long numerical strings (IDs, codes)

{% tabs %}
{% tab title="CZ" %}

#### :flag\_cz: Czech language

For pronunciation of long numerical strings (order IDs etc.):

> > **A.** Write numerical strings without spaces to be read one digit in turn\
> > Objednávka 01304578931\
> > `Objednávka NULA JEDNA TŘI NULA ČTYŘI PĚT SEDM OSM DEVĚT TŘI JEDNA`

> > **B.** Customize pronunciation by breaking it into smaller chunks with interpunction\
> > Objednávka 01, 30, 45, 78, 9-3-1\
> > `Objednávka NULA JEDNA, TŘICET, ČTYŘICET PĚT, SEDMDESÁT OSM, DEVĚT TŘI JEDNA`

<br>
{% endtab %}

{% tab title="SK" %}
:flag\_sk: Slovak language

For pronunciation of long numerical strings (order IDs etc.):

> > **A.** Write numerical strings without spaces to be read one digit in turn\
> > Objednávka 01304578931\
> > `Objednávka NULA JEDEN TRI NULA ŠTYRY PÄŤ SEDEM OSEM DEVÄŤ TRI JEDEN`

> > **B.** Customize pronunciation by breaking it into smaller chunks with interpunction\
> > Objednávka 01, 30, 45, 78, 9-3-1\
> > `Objednávka NULA JEDEN, TRIDSAŤ, ŠTYRIDSAŤPÄŤ, SEDEMDESIAŤOSEM, DEVÄŤ TRI JEDEN`
> > {% endtab %}

{% tab title="EN" %}
:flag\_gb: English language

For pronunciation of long numerical strings (order IDs etc.):

> > **A.** Write numerical strings without spaces to be read one digit in turn\
> > Order 01304578931\
> > `[o one three o four five seven eight one three one]`

> > **B.** Customize pronunciation by breaking it into smaller chunks with interpunction\
> > Order 01, 30, 45, 78, 9-3-1\
> > `[zero one, thirty, forty-five, seventy-eight, nine three one]`

<br>
{% endtab %}
{% endtabs %}

### Numeric date notation

{% tabs %}
{% tab title="CZ" %}
:flag\_cz: Czech language\
\
Dates are **pronounced correctly by default when put down in ISO format:**

> > **A.** ISO DD-MM-YYY\
> > 01-11-1995\
> > 1-11-1995\
> > `Pronounced as [1. listopadu 1995]`

> > **B.** ISO YYYY-MM-DD\
> > 1995-11-01\
> > 1995-11-1\
> > `Pronounced as [1. listopadu 1995]`

> > **C.** ISO DD.MM.YYYY\
> > datum 01.11.1995\
> > datum 01. 11. 1995\
> > datum 1.11.1995\
> > datum 1. 11. 1995\
> > `Pronounced [as 1. listopadu 1995]`

\
A safe and simple way is to **transcribe the date alphanumerically:**

> > datum 1. listopadu 1995\
> > prvního listopadu 1995\
> > `Pronounced as [1. listopadu 1995]`

> > prvního jedenáctý 1995\
> > `Pronounced as [prvního jedenáctý 1995]`

\
These formats **WON'T WORK** and will be pronounced incorrectly, even if tagged with SSML alias reading rules for dates

> > * DD.MM - 1.11. - `[jedna hodina jedenáct minut]`
> > * DD. MM - 1. 11. - `[první jedenáctý]`
> > * DD/MM - 01/11 - `[nula jedna lomítko jedenáct]`
> > * DD/MM/YYYY - 01/11/1995, 1/11/1995 - `[nula jedna lomítko jedenáct lomítko 1995]`
> > * SSML alias reading rules not supported

To be pronounced correctly, dates containing **only date + month** must be written manually

> > prvního listopadu\
> > dne 2. listopadu\
> > 31\. března\
> > 17\. listopad<br>
> > {% endtab %}

{% tab title="SK" %}
:flag\_sk: Slovak language

Dates are **pronounced correctly by default when put down in ISO format**

> > **A.** ISO DD-MM-YYY\
> > 01-11-1995\
> > 1-11-1995\
> > `Pronounced as [1. novembra 1995]`

> > **B.** ISO YYYY-MM-DD\
> > 1995-11-01\
> > 1995-11-1\
> > `Pronounced as [1. novembra 1995]`

> > **C.** ISO DD.MM.YYY\
> > datum 01.11.1995\
> > datum 01. 11. 1995\
> > datum 1.11.1995\
> > datum 1. 11. 1995\
> > `Pronounced [as 1. novembra 1995]`

\
Safe and simple way is to **transcribe date alphanumerically**

> > dátum 1. novembra 1995\
> > prvého novembra 1995\
> > `Pronounced as [1. novembra 1995]`

> > prvého jedenástý 1995\
> > `Pronounced as [prvého jedenástý 1995]`

\
These formats listed below **WON'T WORK** and will be pronounced incorrectly, even if tagged with SSML alias reading rules for dates

> > * DD.MM - 1.11. - `[jedna hodina jedenásť minút]`
> > * DD. MM - 1. 11. - `[prvý jedenásť]`
> > * DD/MM - 01/11 - `[nula jeden lomka jedenásť]`
> > * DD/MM/YYYY - 01/11/1995, 1/11/1995 - `[jeden lomka jedenásť lomka 1995]`
> > * SSML alias reading rules not supported
> >   {% endtab %}

{% tab title="EN" %}
:flag\_gb: English language

Dates are **pronounced correctly by default when put down in ISO format**

> > **A.** ISO DD-MM-YYY\
> > 01-11-1995\
> > 1-11-1995\
> > `Pronounced as [the first of November 1995]`

> > **B.** ISO YYYY-MM-DD\
> > 1995-11-01\
> > 1995-11-1\
> > `UK and US: Pronounced as [the first of November 1995]`

> > **C.** ISO DD.MM.YYY\
> > date 01.11.1995\
> > datum 1.11.1995\
> > `UK: Pronounced as [the first of November 1995]`\
> > `US: Pronounced as [January eleventh 1995]`

> > **D.** ISO DD/MM/YYYY and YYYY/MM/DD\
> > 01/11/1995\
> > 1995/11/01\
> > `UK: Pronounced as [the first of November 1995]`\
> > `UK: Pronounced as [Janualy eleventh 1995]`

> > **D.** Also\
> > Nov,1st 1995\
> > November 1st 1995\
> > `Pronounced as [November first, 1995]`\
> > 1 November 1995\
> > 1 Nov 1995\
> > 1st November 1995\
> > `Pronounced as [the fist of November 1995]`

\
These inputs **WON'T BE pronounced correctly**

* 1 st November 1995 - `[one es tee November 1995]`

* date 1. November 1995 - `[one November 1995]`

* November, 1. 1995 - `Pronounced as [November one 1995]`

* 01/11 (EN-UK voices) - `[zero one eleventh]`

* 1.11.1995 - `[one. eleven. 1995]`

* Be mindful of US date format MM/DD

> > - 01/11 (EN-US voices) - `[January the eleventh]`\ <br>

* Various **SSML tag says-as formats** of attribute date **are supported**

> > - **DMY**  - `US:[November 1st 1995]` , `UK:[the first of November 1995]`
> > - **MDY**: \<say-as interpret-as="date" format="mdy">1/11/1995\</say-as>- `US:[January 11th 1995]` , `UK:[the 11th of January 1995]`
> > - **MD**: \<say-as interpret-as="date" format="md">1/11\</say-as> - `US:[January 11th]` , `UK:[the 11th of January]`
> > - **DM**: \<say-as interpret-as="date" format="dm">1/11\</say-as>- `US:[November 1st]` , `UK:[the 1st of November]`
> > - **MY**: \<say-as interpret-as="date" format="my">11/1995\</say-as>- `US, UK:[November 1995]`¨
> > - **YM**: \<say-as interpret-as="date" format="ym">1995/11\</say-as>- `US, UK:[November 1995]`
> >   {% endtab %}
> >   {% endtabs %}

### Time notation

{% tabs %}
{% tab title="CZ" %}
:flag\_cz: Czech language&#x20;

Standard **iso format is supported and is pronounced as time correctly by default**

> > * HH:mm - 15:52 - `[Patnáct hodin padesát dvě minuty]`
> > * HH:mm:ss - 15:52:25 - `[Patnáct hodin padesát dva minut dvacet pět sekund]`
> > * HH.mm - 15.52 - `[Patnáct hodin padesát dva minut]`

\
**Whole hours**, even if written down digital, a**re pronounced as analog time**

> > * HH:00 - 15:00 - `[Patnáct hodin]`
> > * HH:00:00 - 15:00:00 - `[Patnáct hodin]`

\
**The Czech SI unit system IS&#x20;**<mark style="color:red;">**NOT SUPPORTED**</mark>

> > Won't pronounce *hod* or *h* as hodin/hodiny\
> > 15 hod - `[patnáct hod]`\
> > 15 h - `[patnáct há]`\
> > Won't pronounce *min* or *mins* as minut/minuty\
> > 15 min - `[patnáct min]`\
> > 15 hod 15 min - `[patnáct hod patnáct min]`

\
In case you need **time to be pronounced as analog** or Czech SI units to be pronounced correctly, it is necessary **to transcribe your input**

> > * 15 hod → 15 hodin
> > * 15 min → 15 minut
> > * 2 min → 2 minuty
> > * 15:15 → čtvrt na čtyři | čtvrt na čtyři
> > * 12:00 → poledne
> > * 12:30 → půl jedné odpoledne
> >   {% endtab %}

{% tab title="SK" %}
:flag\_sk: Slovak language

Standard **iso format is supported and is pronounced as time correctly by default**

> > * HH:mm - 16:10 - `[šesťnásť hodín desať minút]`
> > * HH:mm:ss - 16:10:10 - `[šesťnásť hodín desať minút desať sekúnd]`
> > * HH.mm - 16.10 - `[šestnásť hodín desať minút]`

\
**Whole hours**, even if written down digitally, **are pronounced as analog time**

> > * HH:00 - 16:00 - `[šestnásť hodín]`
> > * HH:00:00 - 16:00:00 - `[šestnásť hodín]`

\
Slovak SI unit system **IS SUPPORTED** only for *hod* and *min*

> > 16 hod - `[16 hodín]`\
> > 15 min - `[15 minút]`\
> > 15 h 10 min - `[15 hodín 10 minút]`

> > Won't read properly other SI formats such as\
> > 16 h - `[16 h]`\
> > 10 s - `[15 s]`\
> > 10 sek - `[10 sek]`

> > If you need word *seconds* to be pronounce, write that down manually\
> > 16 hod 10 min 10 sekúnd - `[16 hodín 10 minút 10 sekúnd]`

\
In case you need **time to be pronounced as analog** or Slovak SI units to be pronounced correctly, it is necessary **to transcribe your input**

> > * 15:15 → štvrť na štyri
> > * 12:00 → poludnie
> > * 12:30 → pôl jednej
> > * 10 s → 10 sekúnd
> > * po 8:00 → po osmej hodine
> >   {% endtab %}

{% tab title="EN" %}
:flag\_gb: English language

Standard ISO format is supported and pronounced correctly by default:

> > * 15:15 → quarter past three
> > * 12:00 → at noon
> > * 12:30 → half past twelve
> > * 10 s → 10 seconds
> > * around 8 - around 8-ish

In case you need **time to be pronounced as analog** or time notation to be pronounced correctly, it is necessary **to transcribe your input:**

> > **A.** 10 o'clock\
> > `[10 o'clock]`

> > **B.** 10:00 AM\
> > 10 a.m.\
> > 10 AM\
> > `[10 A. M.]`

Also, other formats of time are supported:

> > **A.** HH:mm:ss\
> > 10:50:10\
> > `[10 hours 5 minutes and 20 seconds]`

> > **B.** HH:mm\
> > 15:00\
> > `UK,US: [three P.M.]`\
> > 10:00\
> > `UK, US: [ten o'clock]`\
> > 10:45\
> > `UK, US: [ten forty five]`

:white\_check\_mark: Standard ISO format is supported and pronounced correctly by default.
{% endtab %}
{% endtabs %}

***

## Customizing speech synthesis of variables

A variable is a named storage location that holds a value in computer programming. It is a fundamental concept used to store and manipulate data within a program. In the context of voicebots and speech applications, variables can be used to store and retrieve information that is relevant to the conversation or user interaction.

#### General use

To use variables in the speech output of a voicebot, you need to **incorporate the variable values** within the text that the voicebot will read out loud (in the Message node, fill in the Speech window).&#x20;

General approach:

1. **Define and store the variable values**: In your voicebot's code or script, define and store the necessary variable values based on user input or other relevant data. For example, you might have a variable named *customer\_name* that stores the user's name.
2. **Construct the speech output text**: Craft the message, and include the variable values where appropriate.
3. &#x20;**Set Speech in Message node**: Paste the constructed speech input for the text-to-speech engine to convert it into audible speech. Your digital assistant will then speak the generated text, incorporating the variable values dynamically.

<figure><img src="https://4261467870-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6iQTvxgRZRwPS1NgIGEb%2Fuploads%2Fw0UkLwKRBCyjalPm2nFF%2Fimage.png?alt=media&#x26;token=bf4642ea-9977-46ef-9878-3c2909ca3e6c" alt=""><figcaption></figcaption></figure>

### Common problems and FAQ

<details>

<summary><span data-gb-custom-inline data-tag="emoji" data-code="1f92f">🤯</span> Why does the neural voice in Azure Audio Content Creator tool spell out every variable as <em>left curly bracket [...]  right curly bracket</em>?</summary>

In Azure Speech Studio's Audio Content Creator, modifying the intonation on the pronunciation of variables in text-to-speech (TTS) output can be challenging due to the following:

* **The audio content creator tool is not connected to your database, so variables are not filled with values, therefore your variable names are considered as common words.** Reading out the literal characters occurs when text-to-speech encounters special characters, such as curly brackets, which are often used in programming languages.
* Dynamic nature of variables: Variables can hold a wide range of values, including names, numbers, or user-generated input. Each variable may require unique intonation patterns or pronunciation rules, making it challenging to create a one-size-fits-all approach within the TTS system.
* Lack of context awareness: Text-to-speech engines typically treat variables as plain text and lack the contextual understanding of the variable's meaning. As a result, it becomes difficult to apply nuanced intonation or pronunciation adjustments specifically to variable content.
* Limited SSML support for variable manipulation: While SSML (Speech Synthesis Markup Language) provides a range of tags and attributes to control TTS output, it may have limited support for manipulating variables. SSML tags are primarily designed to modify the speech synthesis process and don't always provide direct control over variable pronunciation or intonation.
* Pre-trained voice limitations: In some cases, if the TTS output is based on pre-recorded voice samples, modifying the intonation or pronunciation of variables may not be feasible as the pre-recorded voice may not have the necessary flexibility to handle variable-specific modifications.

Addressing these challenges often requires a deeper level of customization and integration within the TTS system. It may involve leveraging advanced techniques like custom voice models, data-driven synthesis, or **employing specific programming logic to manipulate variable-specific intonation patterns.**

</details>

<details>

<summary>Why do I cannot use &#x3C;alias> SSML tags on part of a speech with a variable in it?</summary>

Since variable values are dynamic, there's no point in replacing them with a single alias. Synthesized message would always be the same regardless of the variable's value.

</details>

<details>

<summary>Why is it challenging to customize the synthesis intonation of text with variables?</summary>

Variable values might significantly vary in length and therefore the rhythm of speech, the number of syllables, vowel content and a prosody of a sentence is a slightly different with every case.

```

SPEECH input: Jste prosím {gender] {name_surname}?

Possible text-to-speech outputs:
Jste prosím pan Jan Kár?
Jste prosím pan Ivo Krč?
Jste prosím pan Pavel Novák?
Jste prosím paní Eva Černá?
Jste prosím paní Eliška Drahokoupilová?
Jsem prosím Filémína Strčskrzprstová?
Jste prosím pan Květoslav Podhorodecký?
[...]
```

Only way to handle this situation is to temporarily substitute variables with some dummy values as examples.

<img src="broken-reference" alt="" data-size="original">\
\
:bulb: **Try various intonation curves until finding one configuration that is acceptable on all dummy cases.**&#x20;

</details>
