API Reference#

Base URL (default): http://127.0.0.1:8012

Health Check#

GET /health

Response

{"status": "ok"}

List Input Devices#

GET /api/v1/devices

Returns audio input devices available on the server host. This is only relevant when calling the server-side microphone capture endpoint.

Response

{
  "devices": [
    {"index": 0, "name": "default", "channels": 2, "sample_rate": 44100},
    {"index": 1, "name": "HDA Intel PCH: ALC256 Analog", "channels": 2, "sample_rate": 44100}
  ]
}

List Sessions#

GET /api/v1/sessions

Response

{
  "sessions": [
    {
      "session_id": "3f1e4d2a-...",
      "status": "completed",
      "created_at": "2026-05-08T10:00:00.000Z",
      ...
    }
  ]
}

Get Session#

GET /api/v1/sessions/{session_id}

Response — session snapshot object (see Session Snapshot).


Start Microphone Session#

POST /api/v1/sessions/start
Content-Type: application/json

Begins microphone capture on the server. Returns immediately; the session runs in the background.

Request Body

Field

Type

Default

Description

device

int | string | null

system default

PortAudio input device index or name

sample_rate

int

16000

Capture sample rate in Hz (800048000)

chunk_seconds

float

4.0

Audio chunk length sent to audio-analyzer (0.530)

silence_timeout_seconds

float

1.5

Silence after speech that ends the session (0.210)

max_session_seconds

float

20.0

Hard cap on session duration (1300)

silence_threshold

int

900

RMS threshold below which audio is silence (132767)

language

string | null

null

Language code hint for ASR (e.g. "en")

temperature

float

0.0

ASR decoding temperature (0.01.0)

analyzer_url

string

env default

audio-analyzer transcription endpoint

rag_url

string

env default

RAG query endpoint

tts_url

string

env default

TTS speech endpoint

tts_model

string

"qwen-tts"

Model name for TTS

tts_voice

string | null

env default

Voice name for TTS

tts_language

string | null

"English"

Language hint for TTS

tts_instructions

string | null

env default

Style instructions for TTS

Example

curl --noproxy '*' -X POST http://127.0.0.1:8012/api/v1/sessions/start \
  -H 'Content-Type: application/json' \
  -d '{
        "language": "en",
        "chunk_seconds": 4,
        "silence_timeout_seconds": 1.5,
        "max_session_seconds": 20,
        "silence_threshold": 900
      }'

Response — initial session snapshot with "status": "running".


Start File Session#

POST /api/v1/sessions/start-file
Content-Type: multipart/form-data

Feeds an uploaded audio file through the same chunking, ASR, RAG, and TTS pipeline as a session started through /api/v1/sessions/start. Useful for testing without capture hardware.

Form Fields

Accepts the same fields as Start Microphone Session plus:

Field

Type

Default

Description

file

binary

required

Audio file to process (WAV recommended)

realtime_factor

float

1.0

Playback speed multiplier for simulated real-time pacing (0.0100.0). Set to a large value (e.g. 10.0) to process as fast as possible.

Example

curl --noproxy '*' -X POST http://127.0.0.1:8012/api/v1/sessions/start-file \
  -F "file=@/path/to/question.wav" \
  -F "sample_rate=16000" \
  -F "chunk_seconds=4" \
  -F "silence_timeout_seconds=1.5" \
  -F "max_session_seconds=20" \
  -F "silence_threshold=900" \
  -F "language=en" \
  -F "temperature=0.0" \
  -F "realtime_factor=10.0"

Response — initial session snapshot with "status": "running".


Stop Session#

POST /api/v1/sessions/{session_id}/stop

Requests an early stop of a running session.

Example

curl --noproxy '*' -X POST http://127.0.0.1:8012/api/v1/sessions/<session_id>/stop

Response

{
  "session_id": "3f1e4d2a-...",
  "status": "stopping",
  "stop_requested_at": "2026-05-08T10:00:05.000Z"
}

Session Snapshot#

The session snapshot returned by Get Session, Start, and Start File has the following structure:

Field

Type

Description

session_id

string

Unique session identifier

source_kind

string

"microphone" or "file"

status

string

"created" / "running" / "stopping" / "completed" / "failed"

created_at

string

ISO 8601 timestamp

started_at

string | null

ISO 8601 timestamp

completed_at

string | null

ISO 8601 timestamp

stop_requested_at

string | null

ISO 8601 timestamp

end_reason

string | null

Why the session ended ("silence_timeout", "max_duration", "stopped_by_api", "completed", "error")

error

string | null

Error message if status is "failed"

speech_started

bool

Whether speech was detected

captured_audio_seconds

float

Total seconds of audio captured

transcript

string

Combined transcript of all processed chunks

partial_transcript

string

Same as transcript; updated while running

response

string

Streamed RAG answer text

tts_audio_segments

array

TTS output clips; see below

tts_errors

array

TTS error strings, if any

tts_audio_segments item#

Field

Type

Description

index

int

1-based sentence index within the session

text

string

Sentence text that was synthesized

audio_file

string

Absolute path to the generated WAV file on the server. Accessible to the Gradio UI when both share the same generated_audio/ volume mount.


Polling Pattern#

Start a session, then poll until status is "completed" or "failed":

# Start
SESSION=$(curl -s --noproxy '*' -X POST http://127.0.0.1:8012/api/v1/sessions/start-file \
  -F "file=@question.wav" -F "realtime_factor=10.0" | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")

# Poll
while true; do
  STATUS=$(curl -s --noproxy '*' http://127.0.0.1:8012/api/v1/sessions/$SESSION | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['status'])")
  echo "Status: $STATUS"
  [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]] && break
  sleep 1
done

# Read result
curl -s --noproxy '*' http://127.0.0.1:8012/api/v1/sessions/$SESSION | python3 -m json.tool