How It Works#
This page describes the architecture and the internal flow of a single voice request through Smart Kiosk Assistant.
Architecture#
Smart Kiosk Assistant runs as five cooperating services on a single host.
The browser captures microphone audio and uploads it to kiosk-core, which
orchestrates speech-to-text, retrieval-augmented answer generation, and
speech synthesis through three model-hosting microservices.

Components#
kiosk-ui— Gradio interface. Captures microphone audio via the Web Audio API and posts it tokiosk-core. Polls the session endpoint until answer text and generated audio are available, then plays the audio clips back in order.kiosk-core— FastAPI session orchestrator. Owns the per-session state machine, forwards audio toaudio-analyzer, sends the transcription torag-service, and streams the generated answer sentence-by-sentence totext-to-speech.audio-analyzer— OpenAI-compatible speech-to-text microservice built on Whisper and OpenVINO.rag-service— Local retrieval-augmented generation microservice hosting a Qwen LLM, a BGE embedding model, and a BGE reranker, all on OpenVINO.text-to-speech— OpenVINO TTS microservice supporting SpeechT5 and Qwen-TTS.
kiosk-core and kiosk-ui host no models. All inference happens
inside the three model-hosting services.
Request Flow#
Capture — The browser records a microphone utterance and uploads it to
kiosk-coreas a WAV file along with session parameters.Session start —
kiosk-corecreates a session, returns thesession_idimmediately, and runs the rest of the pipeline in the background. The UI pollsGET /api/v1/sessions/{session_id}to track progress.Speech-to-text —
kiosk-corechunks the upload at silence boundaries and forwards each chunk toaudio-analyzer. The combined transcript is appended to the session snapshot.Retrieval-augmented answer — When the user has finished speaking (silence timeout or max-duration reached),
kiosk-coresends the transcript and recent conversation history torag-service.rag-service:embeds the question with the BGE embedding model,
retrieves candidate chunks from Chroma,
optionally reranks them with the BGE cross-encoder,
prompts the Qwen LLM with the retrieved context, and
streams the answer back token-by-token.
Speech synthesis — As
kiosk-corereceives the answer stream, it splits the text into sentences and posts each sentence totext-to-speech. The generated WAV files are written to the sharedgenerated_audio/volume and recorded in the session snapshot.Playback — The browser UI sees new
tts_audio_segmentsin the session snapshot, downloads them fromkiosk-core, and plays them sequentially.
See the Configuration guide for environment variables, model selection, and per-service device fields.