Text To Speech#
The Text To Speech microservice turns text into natural-sounding speech. It is designed to be dropped into voice-enabled applications (kiosks, assistants, IVR, accessibility tooling) where a simple HTTP request should return either raw WAV audio or a JSON payload with metadata.
Use Cases#
Voice responses for conversational assistants and kiosks.
Accessibility readers and announcement systems.
IVR / call-flow prompts generated on the edge.
Audio generation pipelines that need a self-hosted, OpenAI-compatible
/v1/audio/speechendpoint.
Key Capabilities#
OpenAI-style speech endpoint and a voices/metadata endpoint.
Multi-runtime backends: OpenVINO (Intel-optimized) and PyTorch.
Configurable device (
CPU,GPU,NPU) and precision (int8,int4,fp16,fp32) where the runtime/model supports it.Selectable speaker / voice per model family.
Optional persistence of synthesized output for session reuse.
Supported Models#
SpeechT5 —
microsoft/speecht5_tts(default). Lightweight, English-only, well suited for CPU and edge devices.Qwen3-TTS —
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoicewithmodel_variant: custom_voiceorvoice_designfor richer voice control.Runtimes:
openvino(recommended on Intel hardware) andpytorch.English-only synthesis in the current service build.
Next Steps#
Get Started - a step-by-step guide to your first run.
Configuration - how to select models, devices, and precision.
How It Works - learn about the internal request flow.