ADR 0002 - Barge-In
ADR 0002: Barge-in (Interrupt TTS When User Starts Speaking)¶
Date: 2026-01-23
Status: Accepted (phased)
Amendment (2026-02-04): Phase 1 shipped with an explicit stop mode (stop-phrase barge-in without AEC) and a stricter wait mode (pause mic processing during TTS). The sections below reflect current behavior and implementation.
Context / Problem¶
We want a natural conversation: while the assistant is speaking, the user can start speaking and the assistant stops immediately (“barge-in”).
The hard part is that the microphone hears: - user speech (desired) - the assistant audio coming from the speakers (echo/feedback)
Without echo cancellation, VAD/STT will often detect the assistant’s own voice and self-interrupt.
Investigation (Jan 2026)¶
There are open-source acoustic echo cancellation (AEC) options (WebRTC APM, SpeexDSP), and Python bindings exist, but:
- Many require compilation toolchains, have limited platform wheels, or introduce install friction.
- Some are Linux/macOS only in practice (wheel availability varies).
This conflicts with our core constraint: easy pip install on Windows/macOS/Linux.
Decision¶
Phase 1 (default, out-of-box)¶
We support “interactive discussion” with voice modes that are robust without AEC, plus a stop phrase:
- wait mode (turn-taking): pause mic processing during TTS playback. This is the most robust way to avoid self-interruption on speakers, but it means there is no voice barge-in (and no stop-phrase detection) while audio is playing.
- stop mode (recommended on speakers): keep mic processing running, but during TTS:
- suppress normal transcriptions
- disable “interrupt on any speech”
- keep a rolling stop-phrase detector active so the user can still say “ok stop” / “okay stop” to cut playback
- full mode (barge-in by speech): keep listening and allow barge-in by detected speech. Best with AEC or a headset; on speakers, echo can still cause false barge-in (mitigated by heuristics).
- Stop phrase interruption (available while listening): saying “ok stop” / “okay stop” stops TTS playback without requiring echo cancellation. A conservative bare “stop” is supported but requires confirmation to reduce false hits.
Rationale: without AEC, “interrupt on any speech” is unreliable because the mic hears the assistant audio too. A stop phrase is robust and works everywhere.
Phase 2 (optional advanced barge-in)¶
Add optional barge-in via AEC as an extra (opt-in dependency group), if and only if: - it is MIT/Apache/BSD licensed - it has reliable wheels for the major platforms, or we accept it as “advanced”
Candidate packages to re-evaluate: - WebRTC audio processing AEC bindings (PyPI packages emerging in 2025) - SpeexDSP AEC bindings
Best simulation when AEC is not available¶
If we cannot reliably deploy AEC out-of-box, the best “natural” simulation is:
- default to a non-self-interrupting mode (wait/stop, depending on UX needs)
- allow instant user control via:
- REPL
/stop,/pause,/resume - optional keyboard push-to-talk / push-to-stop
And (only if we can do it robustly) add an “echo-aware VAD” heuristic in the future (correlation against recently played audio) to approximate barge-in without heavy deps.