Skip to content

ADR 0004 - Cloned TTS Streaming

ADR 0004: Streaming + cancellation for cloned TTS (REPL responsiveness)

Date: 2026-01-23
Status: Accepted


Context

Cloned TTS is computationally heavier than Piper. Even after eliminating per-utterance model reload, long answers can still take multiple seconds.

For UX, the key metric is often time-to-first-audio and interruptibility:

  • start speaking quickly (even if the full answer takes longer)
  • allow immediate interruption on new user input (barge-in by typing / speaking)

Decision

1) Stream audio chunks into the existing non-blocking player

  • Generate audio in smaller text batches (sentence/word-bounded chunking).
  • As soon as a batch completes, enqueue it to NonBlockingAudioPlayer.
  • This provides “speak early” behavior without requiring the model to be truly streaming.

2) Add cancellation/preemption

  • Track a per-speak “cancel token” for cloned synthesis.
  • stop_speaking() and new user input set the token, stopping:
  • queued playback immediately
  • further batch enqueueing as soon as the generator checks the token

Note: model sampling is not always preemptible mid-step; we implement best-effort cancellation between batches/chunks.


Consequences

  • Better perceived latency: users hear speech sooner.
  • Cleaner REPL behavior: new input can interrupt current speech.
  • Implementation stays within existing architecture (reuse audio player and callbacks).

  • docs/adr/0002_barge_in_interruption.md (Phase 2 AEC enables true barge-in)
  • Backlog: docs/backlog/planned/021_streaming_and_cancellation_for_cloned_tts.md