# How It Works
This page describes the architecture and internal flow of a TTS request
through the microservice.
## Architecture
At a high level, the Text To Speech service is a FastAPI application that
accepts a JSON request, runs it through a runtime-backed TTS pipeline
(OpenVINO or PyTorch), and returns either raw WAV audio or a JSON envelope
containing metadata and a base64-encoded WAV payload. Models are loaded and
warmed up once per process and reused across requests.
```mermaid
%%{init: {
'theme': 'base',
'themeVariables': {
'fontFamily': '"IntelOne Display", "Intel Clear", "Inter", "Segoe UI", Arial, sans-serif',
'fontSize': '14px',
'primaryColor': '#0068B5',
'primaryTextColor': '#FFFFFF',
'primaryBorderColor': '#00377C',
'lineColor': '#00377C',
'secondaryColor': '#EEF3F8',
'tertiaryColor': '#F7F8FA',
'background': '#FFFFFF',
'mainBkg': '#FFFFFF',
'clusterBkg': '#F7F8FA',
'clusterBorder': '#0068B5',
'edgeLabelBackground': '#FFFFFF',
'noteBkgColor': '#F7F8FA',
'noteTextColor': '#3A3A3A'
}
}}%%
flowchart LR
Client([Client])
subgraph Service["Text To Speech (FastAPI, :8011)"]
API["API Layer
(speech / voices / health)"]
Pipeline["Pipeline Orchestrator
(pipeline.py)"]
Backend["TTS Backend
(openvino | pytorch)"]
Voices["Voice / Speaker Registry"]
Session[("Session Store
storage/<session_id>/")]
end
Models[("Model Cache
models/")]
Device{{"Inference Device
CPU / GPU / NPU"}}
Client -- "POST /v1/audio/speech
GET /v1/audio/voices" --> API
API --> Pipeline
Pipeline --> Voices
Pipeline --> Backend
Backend --> Device
Backend -. loads / warms up .-> Models
Pipeline <-- "optional persist" --> Session
Pipeline -- "audio/wav or JSON + base64 WAV
X-Session-ID header" --> Client
classDef client fill:#FFFFFF,stroke:#0068B5,stroke-width:2px,color:#3A3A3A;
classDef core fill:#0068B5,stroke:#00377C,stroke-width:1.5px,color:#FFFFFF;
classDef backend fill:#00A3F4,stroke:#00377C,stroke-width:1.5px,color:#FFFFFF;
classDef store fill:#6C6C6C,stroke:#0068B5,stroke-width:1.5px,color:#FFFFFF;
classDef device fill:#00C7FD,stroke:#00377C,stroke-width:1.5px,color:#3A3A3A;
class Client client;
class API,Pipeline,Voices core;
class Backend backend;
class Session,Models store;
class Device device;
style Service fill:#F7F8FA,stroke:#0068B5,stroke-width:1.5px,color:#3A3A3A;
```
**Key planes:**
- **API layer** — request validation, language/voice resolution, and
response shaping (raw `audio/wav` vs. JSON envelope).
- **Pipeline orchestrator** — owns model load/warmup, speaker resolution,
synthesis, and optional persistence.
- **TTS backend** — pluggable OpenVINO or PyTorch runtime selected via
config; handles model placement on the configured device and precision.
- **Voice registry** — exposes the available speakers/voices for the
active model and resolves the request's `voice` field.
- **Session store** — when `pipeline.persist_outputs` is true, the
synthesized WAV and metadata are written under `storage//`.
## Request Flow
1. **Request** — A client sends a JSON body to `POST /v1/audio/speech` with
the text to synthesize and an optional `voice`, `language`,
`instructions`, and `response_format`.
2. **Validation** — The service validates the request, enforces the English
language constraint, and resolves the speaker against the configured
voices.
3. **Model load / warmup** — On first use, the configured TTS model is
loaded according to `models.tts.runtime` (`openvino` or `pytorch`) on the
configured `device` (`CPU`, `GPU`, or `NPU`) and `dtype`. Subsequent
requests reuse the warmed-up pipeline.
4. **Synthesis** — The pipeline generates a WAV waveform from the input
text using the chosen model and speaker embedding.
5. **Response** — When `response_format=wav`, the service returns raw
`audio/wav` with `X-Session-ID` in the response header. When
`response_format=json`, it returns metadata plus a base64-encoded WAV
payload.
6. **Persistence (optional)** — If `pipeline.persist_outputs` is true, the
WAV and metadata are also written to `storage//`.
## Components
- `api/` — FastAPI routers for speech generation, voice metadata, and
health.
- `pipeline.py` — Orchestrates model loading, warmup, and synthesis.
- `components/` — Backend implementations for the OpenVINO and PyTorch TTS
runtimes.
- `utils/` — Audio utilities, config loading, and session helpers.
- `dto/` — Request and response data models.
## Configuration Surface
All runtime behavior is driven by `config.yaml`, shared by both standalone
and container runs, with targeted overrides via `TEXT_TO_SPEECH__...`
environment variables. See [Configuration Guide](./get-started/configuration.md) for the
full list of fields.