Customise Voice Synthesis With Azure Speech And SSML

In this blog post Customise Voice Synthesis With Azure Speech And SSML we will walk through how to shape exactly how your applications sound using Microsoft Azure Speech and SSML.

Whether you are building an IVR, virtual assistant, e-learning product, or accessibility feature, the way your app speaks matters. In Customise Voice Synthesis With Azure Speech And SSML, we will start with what the technology does at a high level, then move into practical examples you can apply today.

What Azure Speech And SSML Actually Do

Azure Speech is part of Azure Cognitive Services. It offers text-to-speech (TTS), speech-to-text, translation, and related capabilities via cloud APIs. For TTS specifically, Azure uses deep neural networks trained on recordings from voice actors and large language datasets.

When you send text to the Azure Speech service, the model converts it into a phonetic representation, predicts prosody (timing, emphasis, intonation), and then generates an audio waveform. The result sounds far more natural than the robotic synthesis from a few years ago.

SSML (Speech Synthesis Markup Language) is an XML-based standard that lets you control how that text is spoken. Instead of just sending “plain” text, you wrap parts of your content in SSML tags to guide the engine — things like rate, pitch, pauses, pronunciation, and which voice to use.

Why Customisation Matters

Plain TTS output is often acceptable, but it can be:

Too fast or too slow for your audience
Monotone, lacking emphasis on key information
Incorrect for names, acronyms, or technical jargon
Inconsistent across different languages or channels

With SSML and Azure Speech, you can tune the experience to:

Match your brand voice and tone
Improve comprehension and reduce user effort
Handle complex content like phone numbers, dates, and URLs
Localise behaviour to different regions and languages

Think of SSML as the difference between “reading text out loud” and “performing a script”.

Core Concepts You Need To Know

1. Voices And Neural TTS

Azure provides a catalogue of voices in different languages, including neural voices that sound more natural and expressive. Each voice has an identifier like en-AU-NatashaNeural or en-US-GuyNeural.

You can select a voice in SSML using the <voice> element, or via API parameters. Neural voices are generally what you want for modern applications unless cost or compatibility pushes you elsewhere.

2. SSML Structure

SSML documents are XML. A typical minimal document looks like this:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-AU">
  <voice name="en-AU-NatashaNeural">
    Hello from Azure Speech using SSML.
  </voice>
</speak>

Key elements you will use often:

<speak> – root container for SSML
<voice> – choose a specific voice
<prosody> – control rate, pitch, volume
<break> – insert pauses
<say-as> – control pronunciation of numbers, dates, etc.
 – substitute a different spoken form

3. Text-only Vs SSML Requests

When calling Azure Speech, you can send either:

Plain text with a chosen voice and language
SSML markup that embeds both the text and all instructions

SSML gives you far more control and should be your default for anything customer-facing or branded.

Setting Up Azure Speech On CloudProinc.com.au Projects

From a project perspective, you will usually:

Create an Azure resource for Speech in the Azure Portal.
Capture the Key and Region for your resource.
Use the Speech SDK (or REST) from your chosen platform (.NET, Node.js, Python, Java, etc.).
Start with plain TTS, then add SSML as you refine the experience.

Below we focus on SSML. The examples use REST so they are portable across tech stacks commonly used in CloudProinc.com.au customer environments.

Basic SSML Example With Azure Speech REST API

The simplest call to Azure Speech with SSML looks like this (pseudo-curl):

curl -X POST \
  -H "Ocp-Apim-Subscription-Key: <YOUR_KEY>" \
  -H "Content-Type: application/ssml+xml" \
  -H "X-Microsoft-OutputFormat: audio-16khz-32kbitrate-mono-mp3" \
  "https://<YOUR_REGION>.tts.speech.microsoft.com/cognitiveservices/v1" \
  --data "<speak version='1.0' \
      xmlns='http://www.w3.org/2001/10/synthesis' \
      xml:lang='en-AU'>
    <voice name='en-AU-NatashaNeural'>
      Welcome to CloudPro, your cloud transformation partner.
    </voice>
  </speak>"

This returns an MP3 audio stream you can save or stream directly to users.

Controlling Pace, Pitch, And Volume With <prosody>

The <prosody> element lets you adjust:

rate – how fast the text is spoken
pitch – perceived tone (higher or lower)
volume – relative loudness

Values can be preset keywords (slow, fast, etc.) or percentages.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-AU">
  <voice name="en-AU-NatashaNeural">
    <prosody rate="-10%">
      This sentence is slightly slower for clarity.
    </prosody>

    <prosody rate="fast" pitch="+2st">
      This part is more energetic and upbeat.
    </prosody>
  </voice>
</speak>

Practical guidance:

For IVR or support flows, slightly slower helps comprehension.
For marketing or product tours, a modest speed increase can sound more engaging.
Avoid extreme pitch shifts; they can sound unnatural even in neural voices.

Adding Natural Pauses With <break>

Humans pause to let information sink in. TTS should do the same.

<break> supports time (e.g. 500ms) or strength (e.g. medium, strong).

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-AU">
  <voice name="en-AU-NatashaNeural">
    Welcome to CloudPro.<break time="400ms"/>
    We help you modernise infrastructure,<break strength="medium"/>
    optimise costs,<break strength="medium"/>
    and accelerate delivery.
  </voice>
</speak>

Use longer pauses when transitioning between sections or emphasising critical instructions (e.g. one-time passwords, safety information).

Fixing Pronunciation With <say-as> And

Some content simply reads badly if left to default pronunciation — product names, acronyms, or mixed-format identifiers. SSML helps here in two ways.

Using <say-as> For Structured Data

<say-as> tells the engine what the content is so it can handle it correctly.

<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-AU">
  <voice name="en-AU-NatashaNeural">
    Your appointment is on
    <say-as interpret-as="date" format="dmy">25/11/2025</say-as>.

    Please call <say-as interpret-as="telephone">1300 555 123</say-as>
    if you need to reschedule.
  </voice>
</speak>

Common interpret-as values:

number
digits (read one digit at a time)
telephone
date
time
currency

Using For Custom Spoken Forms

 lets you display one text but speak another. This is ideal for product names or abbreviations.

<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-AU">
  <voice name="en-AU-NatashaNeural">
    Welcome to <sub alias="Cloud Pro">CloudProinc.com.au</sub>.

    Your CPU utilisation on
    <sub alias="Kubernetes">K8s</sub>
    cluster one is within normal range.
  </voice>
</speak>

Use  anywhere your written form is not how you want it spoken.

Switching Voices And Languages Dynamically

You can mix voices and even languages within a single SSML document. This is handy for bilingual content or when you want different voices for narration and system messages.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
  <voice name="en-AU-NatashaNeural">
    Welcome to your multi-lingual assistant.
  </voice>

  <voice name="ja-JP-NanamiNeural">
    <lang xml:lang="ja-JP">
      こちらは日本語の案内です。
    </lang>
  </voice>
</speak>

From a design standpoint, avoid switching voices too frequently; it can be distracting. Use changes for clear role or language boundaries.

Emphasis And Expressiveness

Neural voices respond well to subtle emphasis. The <emphasis> tag lets you highlight words without rewriting the text.

<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-AU">
  <voice name="en-AU-NatashaNeural">
    <emphasis level="moderate">
      Important
    </emphasis>
    update: your backup did not complete last night.
  </voice>
</speak>

Use emphasis sparingly — too much and the speech feels artificial. Focus it on warnings, key actions, or numbers that users must remember.

Practical Patterns For Real Applications

1. IVR And Contact Centre Menus

Goals: clarity, low cognitive load, and quick navigation.

Slow the overall rate slightly.
Add brief pauses between options.
Emphasise key words like “billing”, “support”, or option numbers.

<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-AU">
  <voice name="en-AU-NatashaNeural">
    <prosody rate="-5%">
      For <emphasis>support</emphasis>, press 1.<break time="300ms"/>
      For <emphasis>billing</emphasis>, press 2.<break time="300ms"/>
      For all other enquiries, press 3.
    </prosody>
  </voice>
</speak>

2. E-learning And Training Content

Goals: engagement and retention.

Use a friendly neural voice.
Vary prosody slightly between sections to avoid monotony.
Add pauses before and after key definitions.

3. Operational Alerts And Dashboards

Goals: quick understanding and appropriate urgency.

Keep messages short.
Use emphasis and pauses for severity and actions.
Maintain a consistent style across channels (web, mobile, voice).

Testing, Tuning, And Governance

For technical leaders and managers, the challenge is less about can we do this and more about how we maintain a consistent, high-quality voice across products.

Some practical practices:

Centralise SSML templates – store core prompts and patterns in a shared repository rather than scattering them through code.
Define a voice style guide – rate, tone, and specific voices that represent your brand (similar to design systems for visuals).
Automated testing – for critical flows, record sample outputs and include audio regression tests where feasible.
Accessibility reviews – involve users who rely on assistive technologies; adjust speed and clarity based on real feedback.

When To Go Further With Custom Neural Voices

If standard Azure voices cannot match your brand or requirements, Azure also offers Custom Neural Voice. This lets you train a bespoke voice using recorded samples (with strict consent and compliance rules).

From a CloudProinc.com.au project view, custom voices are most appropriate when:

Your organisation has a strong audible brand (e.g. media, finance, telco).
You need consistent character voices across multiple apps and campaigns.
You are prepared to manage governance, consent, and ongoing review.

Even with custom voices, SSML remains the primary tool for controlling how that voice behaves in different contexts.

Next Steps

To start bringing this into your own solutions:

Identify 2–3 key voice interactions in your product (IVR menus, onboarding flows, alerts).
Convert them from plain text to SSML using the patterns above.
Listen with headphones and speakers; adjust rate, pauses, and emphasis.
Run a quick user test with real customers or staff and refine.

Azure Speech and SSML give you a flexible, scalable way to move from generic TTS to polished, human-friendly voice experiences. With a modest investment in design and testing, you can significantly improve how your applications sound — and how users feel when they interact with them.

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.