Skip to content

Voice Cloning (Jan 2026) — Feasibility & Options

Goal

Add near‑realtime voice cloning while respecting: - Permissive licensing (MIT/Apache 2.0 for code + model assets) - Simple installs for the base package (pip install abstractvoice)

In practice, this almost certainly means voice cloning is an optional extra (GPU/torch-heavy), while the base install stays lightweight and cross‑platform.

Current status (implementation)

This document started as a feasibility note (Jan 2026). The current status below reflects the shipped implementation as of v0.8.x.

  • abstractvoice[cloning]f5_tts engine (OpenF5 artifacts)
  • abstractvoice[chroma]chroma engine (Chroma-4B; GPU-heavy)
  • abstractvoice[audiodit]audiodit engine (LongCat-AudioDiT-1B; prompt-audio cloning)
  • abstractvoice[omnivoice]omnivoice engine (OmniVoice; omnilingual; prompt-audio cloning + voice design)

Operational notes:

  • Cloned voices are engine-bound (f5_tts vs chroma vs audiodit vs omnivoice); selecting a cloned voice uses its stored engine.
  • The REPL is offline-first; downloads are explicit via:
  • python -m abstractvoice download --openf5|--chroma|--audiodit|--omnivoice
  • REPL: /cloning_download f5_tts|chroma|audiodit|omnivoice
  • reference_text is optional:
  • If missing, AbstractVoice auto-generates it via STT (3-pass consensus) on first speak and persists it for the voice.
  • In offline-first contexts, this requires a cached STT model (prefetch: python -m abstractvoice download --stt small).
  • For best quality, use a clean reference sample and (optionally) set an accurate reference_text:
  • f5_tts: 6–10s often works well.
  • chroma: upstream examples use longer prompts (≈ 15–30s), and quality may degrade with very short prompts.
  • audiodit: 6–15s prompt audio is a reasonable starting range; it is currently capped by default.
  • A bad transcript can noticeably degrade cloning quality.
  • The Chroma backend normalizes prompt audio to mono 24kHz PCM16 and caps extreme prompt lengths for stability.

What “voice cloning” can mean

  • (A) Multi-speaker / voice prompting: provide a voice prompt or speaker embedding, model synthesizes in that style.
  • (B) Voice conversion: convert one speaker’s audio into another speaker’s timbre.
  • (C) Fine-tuning: train/finetune to a specific voice (slow, not “instant”).

For “clone in seconds”, we want (A) and/or (B), not (C).


Candidates (permissive license signals)

1) VoxCPM (OpenBMB) — Apache 2.0

  • Claim: zero-shot voice cloning from short reference audio; streaming inference.
  • Perf: reported RTF ~0.17 on RTX 4090 (good for realtime).
  • Install: heavy (PyTorch + GPU recommended); likely optional extra.
  • Languages: training described as Chinese/English; may not meet full multilingual requirements (FR/DE/ES/RU).

2) Dia (Nari Labs) — Apache 2.0

  • Claim: zero-shot voice cloning from seconds of reference audio; expressive dialogue.
  • Perf: “realtime” on GPU; reports ~10GB VRAM requirement.
  • Install: heavy; optional extra.
  • Languages: unclear breadth; likely not full multilingual coverage.

3) MetaVoice‑1B — Apache 2.0

  • Claim: zero-shot cloning for some English accents; finetuning for others.
  • Install: heavy; optional extra.
  • Languages: English focus.

4) VibeVoice‑Realtime‑0.5B (Microsoft) — MIT

  • Strength: realtime/low-latency TTS.
  • Gap: not clearly a “clone any voice from reference audio” solution (depends on model variant).

5) OpenVoice (MyShell)

  • License ambiguity: repository is commonly labeled MIT, but there are credible reports/issues indicating non‑commercial restrictions.
  • Conclusion: do not adopt unless we can unambiguously confirm permissive terms for both code and weights.

6) XTTS‑v2 (Coqui)

  • License: CPML (non‑permissive / non‑commercial) → rejected for our constraints.

Recommendation (current stance)

Keep the base package clean

  • Keep voice cloning and torch-heavy engines behind optional extras.
  • Require explicit artifact downloads (abstractvoice-prefetch ... or python -m abstractvoice download ...) for offline-first deployments.
  • Treat each engine as a separate runtime with separate licensing, hardware, and language-quality expectations.

Use the common API, but test per engine

The shared user-facing shape is implemented:

  • VoiceManager.clone_voice(reference_audio, name=..., engine=...) -> voice_id
  • VoiceManager.speak(text, voice=voice_id)
  • VoiceManager.speak_to_bytes(text, voice=voice_id)
  • export_voice() / import_voice() for portability

Quality, latency, and language coverage still vary by engine and hardware. Validate the exact model/version you plan to ship.


Ongoing evaluation questions

  • Which candidate covers our target languages (EN/FR/DE/ES/RU/ZH) without restrictive licensing?
  • Can we support CPU-only “near realtime”, or is GPU a hard requirement?
  • Can we standardize “reference audio format” and required length?