# How It Works This page describes the architecture and the internal flow of a single voice request through Smart Kiosk Assistant. ## Architecture Smart Kiosk Assistant runs as five cooperating services on a single host. The browser captures microphone audio and uploads it to `kiosk-core`, which orchestrates speech-to-text, retrieval-augmented answer generation, and speech synthesis through three model-hosting microservices. ![Smart Kiosk Assistant architecture](./_assets/architecture.png "smart kiosk assistant architecture") ## Components - `kiosk-ui` — Gradio interface. Captures microphone audio via the Web Audio API and posts it to `kiosk-core`. Polls the session endpoint until answer text and generated audio are available, then plays the audio clips back in order. - `kiosk-core` — FastAPI session orchestrator. Owns the per-session state machine, forwards audio to `audio-analyzer`, sends the transcription to `rag-service`, and streams the generated answer sentence-by-sentence to `text-to-speech`. - `audio-analyzer` — OpenAI-compatible speech-to-text microservice built on Whisper and OpenVINO. - `rag-service` — Local retrieval-augmented generation microservice hosting a Qwen LLM, a BGE embedding model, and a BGE reranker, all on OpenVINO. - `text-to-speech` — OpenVINO TTS microservice supporting SpeechT5 and Qwen-TTS. `kiosk-core` and `kiosk-ui` host no models. All inference happens inside the three model-hosting services. ## Request Flow 1. **Capture** — The browser records a microphone utterance and uploads it to `kiosk-core` as a WAV file along with session parameters. 2. **Session start** — `kiosk-core` creates a session, returns the `session_id` immediately, and runs the rest of the pipeline in the background. The UI polls `GET /api/v1/sessions/{session_id}` to track progress. 3. **Speech-to-text** — `kiosk-core` chunks the upload at silence boundaries and forwards each chunk to `audio-analyzer`. The combined transcript is appended to the session snapshot. 4. **Retrieval-augmented answer** — When the user has finished speaking (silence timeout or max-duration reached), `kiosk-core` sends the transcript and recent conversation history to `rag-service`. `rag-service`: - embeds the question with the BGE embedding model, - retrieves candidate chunks from Chroma, - optionally reranks them with the BGE cross-encoder, - prompts the Qwen LLM with the retrieved context, and - streams the answer back token-by-token. 5. **Speech synthesis** — As `kiosk-core` receives the answer stream, it splits the text into sentences and posts each sentence to `text-to-speech`. The generated WAV files are written to the shared `generated_audio/` volume and recorded in the session snapshot. 6. **Playback** — The browser UI sees new `tts_audio_segments` in the session snapshot, downloads them from `kiosk-core`, and plays them sequentially. See the [Configuration](./get-started/configuration.md) guide for environment variables, model selection, and per-service device fields.