ADR 0004 - Cloned TTS Streaming

ADR 0004: Streaming + cancellation for cloned TTS (REPL responsiveness)¶

Date: 2026-01-23
Status: Accepted

Cloned TTS is computationally heavier than Piper. Even after eliminating per-utterance model reload, long answers can still take multiple seconds.

For UX, the key metric is often time-to-first-audio and interruptibility:

Generate audio in smaller text batches (sentence/word-bounded chunking).
As soon as a batch completes, enqueue it to NonBlockingAudioPlayer.
This provides “speak early” behavior without requiring the model to be truly streaming.

Note: model sampling is not always preemptible mid-step; we implement best-effort cancellation between batches/chunks.

Better perceived latency: users hear speech sooner.
Cleaner REPL behavior: new input can interrupt current speech.
Implementation stays within existing architecture (reuse audio player and callbacks).

docs/adr/0002_barge_in_interruption.md (Phase 2 AEC enables true barge-in)
Backlog: docs/backlog/planned/021_streaming_and_cancellation_for_cloned_tts.md