ADR 0003 - Reference Text Fallback

ADR 0003: Voice cloning reference text (automatic fallback strategy)¶

Date: 2026-01-23
Status: Accepted (amended 2026-01-28)

Our cloning engines condition on a reference transcript + the generation text:

f5_tts: ref_text + gen_text
chroma: prompt_text (reference transcript) + prompt_audio (reference audio)

This means reference_text is not “metadata”: it directly influences generation.

If ref_text is wrong (e.g. noisy ASR), the model can:

We want a high-quality, automatic experience even when the user does not provide reference_text.

When reference_text is missing:

This avoids repeating the “poisoned prompt” problem across utterances.

Implementation note:

We do this lazily at first speak (not at clone-time) so cloning remains offline-first and does not force STT downloads during /clone ... flows.
This fallback is engine-agnostic and must behave the same for f5_tts and chroma.

To reduce occasional ASR instability, we run STT 3 times and select a consensus transcript:

Normalize each candidate (whitespace + sentence-ending punctuation).
If a majority exists (2/3 match after normalization), choose it.
Otherwise, choose the “closest” candidate (minimum total edit distance to the others).

If STT produces no usable text, we fail with a clear error and require the user to set reference_text manually.

If we need better robustness than single-pass ASR:

Generate several candidate transcripts (different STT models/decoding/normalizations).
Synthesize a short probe phrase for each candidate.
Select the candidate that maximizes speaker similarity vs reference audio (speaker embeddings).

This optimizes the true objective (speaker similarity) rather than ASR accuracy alone.

We accept that fully automatic fallback is probabilistic, but we constrain the blast radius by persisting a best-effort transcript once and enabling easy correction.
We prefer deterministic behavior (one-time preprocessing + caching) over silent repeated re-transcription.