Skip to content

API

This is the supported integrator contract for AbstractVoice.

Start with README.md and docs/getting-started.md for setup. Use docs/faq.md for cache/history reset and troubleshooting, docs/repl_guide.md for the interactive REPL, and docs/architecture.md for implementation details.

Implementation map: - abstractvoice/voice_manager.pyabstractvoice/vm/manager.py (constructor + wiring) - abstractvoice/vm/tts_mixin.py (TTS + cloning methods) - abstractvoice/vm/stt_mixin.py (STT + listening methods) - abstractvoice/vm/core.py (voice-mode behavior during playback)

Primary entry point

  • abstractvoice.VoiceManager
  • abstractvoice.VoiceProfile (data type; used by the voice-profile APIs)
from abstractvoice import VoiceManager

vm = VoiceManager(language="en", remote_api_key="sk-...", allow_downloads=True)

Constructor (most-used knobs)

The source of truth is abstractvoice/vm/manager.py:

VoiceManager(
    language: str = "en",
    tts_model: str | None = None,
    whisper_model: str = "base",
    debug_mode: bool = False,
    tts_engine: str = "openai",
    stt_engine: str = "openai",
    allow_downloads: bool = True,
    cloned_tts_streaming: bool = True,
    cloning_engine: str = "f5_tts",
    tts_delivery_mode: str | None = None,  # buffered|streamed (override)
    stt_model: str | None = None,
    remote_base_url: str | None = None,
    remote_api_key: str | None = None,
    remote_timeout_s: float | None = None,
)

Notes: - VoiceManager() and auto are remote-first. Hosted OpenAI audio requires OPENAI_API_KEY or remote_api_key=.... - allow_downloads gates implicit local model downloads in adapters. The REPL sets False (offline-first). - whisper_model controls the faster-whisper model size used by local listen() / transcribe_*() paths. - tts_engine supports: - openai (default; remote OpenAI /v1/audio/speech; requires OPENAI_API_KEY) - auto (deterministic default: resolves to openai) - piper (local TTS; requires abstractvoice[local] or abstractvoice[piper]) - openai-compatible (remote compatible /v1/audio/speech; configure remote_base_url or ABSTRACTVOICE_REMOTE_BASE_URL) - audiodit (LongCat-AudioDiT; requires abstractvoice[audiodit]; upstream focuses on EN/ZH; direct/base TTS has a known quality caveat in 0.8.1) - omnivoice (OmniVoice; requires abstractvoice[omnivoice]; upstream supports 600+ languages) - stt_engine supports openai|auto|faster_whisper|openai-compatible. auto resolves to openai. The local faster-whisper path requires abstractvoice[local] or abstractvoice[stt]. Missing credentials or missing explicit local dependencies raise actionable errors; the legacy OpenAI Whisper fallback was removed. - tts_model is reserved/back-compat for local Piper (selection is language-driven today); for remote TTS it maps to the request model. - For remote STT, stt_model maps to the transcription model. - Remote configuration can be passed in the constructor or via env vars: - OpenAI: OPENAI_API_KEY, optional ABSTRACTVOICE_OPENAI_TTS_MODEL, ABSTRACTVOICE_OPENAI_STT_MODEL - Compatible endpoints: ABSTRACTVOICE_REMOTE_BASE_URL, optional ABSTRACTVOICE_REMOTE_API_KEY, ABSTRACTVOICE_REMOTE_TTS_MODEL, ABSTRACTVOICE_REMOTE_STT_MODEL - OpenAI-compatible profile discovery: GET /audio/voices is tried by default for compatible endpoints; override with ABSTRACTVOICE_REMOTE_VOICE_PROFILE_PATH or ABSTRACTVOICE_REMOTE_VOICE_PROFILE_PATHS. Static voice/profile ids can also be supplied with ABSTRACTVOICE_REMOTE_TTS_VOICES. - tts_delivery_mode is an optional override that applies consistently to both base TTS and cloned voices: - buffered: synthesize full audio first (one payload) - streamed: deliver audio in chunks when available (lower time-to-first-audio)

Supported language codes for the Piper mapping: en, fr, de, es, ru, zh (see abstractvoice/config/voice_catalog.py and abstractvoice/adapters/tts_piper.py). For non-Piper engines (e.g. OmniVoice or remote OpenAI-compatible engines), language is treated as a pass-through hint and the engine decides what it supports.

TTS (text → audio)

  • speak(text: str, speed: float = 1.0, callback=None, voice: str | None = None, *, sanitize_syntax: bool = True) -> bool
  • Plays audio locally (non-blocking playback; synthesis time depends on backend).
  • If voice is provided, it is treated as a cloned voice_id (requires abstractvoice[cloning]).
  • By default, common Markdown syntax is stripped from spoken output (headers + emphasis). Set sanitize_syntax=False to speak raw text.

  • set_speed(speed: float) -> bool, get_speed() -> float

  • Adjusts the default speaking speed used by speak_to_*() and the REPL.

  • set_tts_quality_preset(preset: str) -> bool, get_tts_quality_preset() -> str | None

  • Engine-agnostic speed/quality knob (low|standard|high). Back-compat aliases: fastlow, balancedstandard.
  • Engines that don’t support quality tuning may return False / None (Piper is typically a no-op).
  • For AudioDiT this primarily maps to diffusion steps (and a small guidance-strength tweak).

  • get_profiles(*, kind: str = "tts") -> list[VoiceProfile]

  • set_profile(profile_id: str, *, kind: str = "tts") -> bool
  • get_active_profile(*, kind: str = "tts") -> VoiceProfile | None
  • Cross-engine voice profile abstraction (preset packs).
  • Profiles are engine-local: you select tts_engine first, then apply a profile id for that engine.
  • Engines without profiles return an empty list / False / None.
  • Concurrency note: profile selection mutates engine state. For servers, prefer one VoiceManager per session (or guard profile changes with a lock).
  • Remote OpenAI note: hosted built-in voices are always exposed as profiles (for example vm.set_profile("alloy")), and the adapter also tries OpenAI voice discovery for account/org-specific voices such as voice_.... tts_engine="openai" defaults to https://api.openai.com/v1 and reads OPENAI_API_KEY.
  • Remote compatible note: compatible endpoints may expose GET /v1/audio/voices (adapter path: GET /audio/voices) returning profiles, voices, cloned_voices, or OpenAI-style data. Returned ids are exposed as VoiceProfiles and used as the request voice for /audio/speech.
  • The voice= argument on speak_to_bytes(...) remains the cloned-voice handle path for backward compatibility; select base-provider voices with set_profile(...).
  • OmniVoice notes:

    • Some profiles may enable persistent prompt caching (a tokenized voice_clone_prompt). The first set_profile(...) can pay a one-time build cost; later synthesis reuses cached tokens for stable voice identity. Prompt-conditioned synthesis can be heavier than pure voice design; use /tts quality low|standard|high (or VoiceManager.set_tts_quality_preset(...)) to tune the trade-off.
    • On macOS / Apple Silicon, OmniVoice uses MPS (Metal) by default when device="auto".
  • pause_speaking() -> bool, resume_speaking() -> bool, stop_speaking() -> bool

  • Playback control.

  • is_speaking() -> bool, is_paused() -> bool

  • Playback state helpers.

  • set_tts_delivery_mode(mode: str | None) -> bool, get_tts_delivery_mode() -> str, get_tts_delivery_modes() -> dict

  • Toggle buffered vs streamed delivery (applies to both base TTS and cloned voices).
  • Behavior note: streamed delivery is implemented as a pipeline:

    • text is chunked into short segments (sentence-first),
    • then each segment is synthesized and enqueued as soon as possible.
    • Engines that can stream audio natively may further reduce TTFB by yielding multiple audio chunks per segment.
  • speak_to_bytes(text: str, format: str = "wav", voice: str | None = None, *, sanitize_syntax: bool = True) -> bytes

  • Headless/server‑friendly: returns encoded audio bytes.

  • speak_to_audio_chunks(text: str, *, voice: str | None = None, sanitize_syntax: bool = True) -> Iterator[tuple[np.ndarray, int]]

  • Headless/server‑friendly: yields (audio_chunk, sample_rate) tuples for incremental delivery.

  • open_tts_text_stream(*, voice: str | None = None, callback=None, sanitize_syntax: bool = True, max_chars: int | None = None, min_chars: int | None = None) -> TextToSpeechStream

  • Push-based streaming bridge for LLM streaming → TTS streaming pipelining.
  • Returned object supports: .push(delta), .close(), .cancel(), .join(timeout=...).

  • speak_to_file(text: str, output_path: str, format: str | None = None, voice: str | None = None, *, sanitize_syntax: bool = True) -> str

  • Writes an audio file and returns the path.

Language & voice selection (Piper path)

  • set_language(language: str) -> bool
  • Switches the active language.
  • For explicit Piper, validation uses the curated Piper mapping in abstractvoice/config/voice_catalog.py.
  • For non-Piper engines such as OmniVoice, the language code is passed through to the adapter and the engine decides what it supports.
  • If microphone listening is active, the recognizer is recreated on the next listen(...) call so STT receives the updated language.

  • get_language() -> str, get_language_name(language_code: str | None = None) -> str

  • get_supported_languages() -> list[str]

  • list_available_models(language: str | None = None) -> dict

  • Lists voice/model catalog entries for CLI/web display (see abstractvoice/vm/tts_mixin.py).
  • Piper returns local voice cache status by language.
  • OpenAI/OpenAI-compatible TTS returns remote voice profiles plus configured/discovered TTS model ids when the active adapter supports model listing.
  • Back-compat alias: list_voices().

  • set_voice(language: str, voice_id: str) -> bool

  • Backward-compatible method; Piper voice selection is currently best-effort.

STT (audio → text)

  • transcribe_file(audio_path: str, language: str | None = None) -> str
  • Transcribes audio from a file.

  • transcribe_from_bytes(audio_bytes: bytes, language: str | None = None) -> str

  • Transcribes audio sent over the network.

STT configuration

  • set_whisper(model_name: str) -> None | bool
  • Updates the faster‑whisper model size used for subsequent operations.

  • get_whisper() -> str

Microphone capture (local assistant mode)

  • listen(on_transcription, on_stop=None) -> bool
  • Starts microphone capture + VAD + STT in-process (abstractvoice/recognition.py).
  • Stop phrase(s): "ok stop", "okay stop", and (conservatively) "stop"; see abstractvoice/recognition.py and abstractvoice/stop_phrase.py.

  • stop_listening() -> bool

  • Stops microphone capture.

  • pause_listening() -> bool, resume_listening() -> bool

  • Pauses/resumes audio processing while keeping the listening thread alive.

  • is_listening() -> bool

  • Whether the background recognizer thread is running.

  • cleanup() -> bool

  • Best-effort cleanup for long-lived apps (stop listening, stop speaking, release audio resources).

Advanced tuning (best-effort)

  • change_vad_aggressiveness(aggressiveness: int) -> bool
  • For advanced mic/VAD tuning; see abstractvoice/recognition.py.

Voice modes (behavior while speaking)

Voice modes control what the microphone loop does while TTS is playing. Set via:

  • set_voice_mode(mode: str) -> bool where mode ∈ {"full","wait","stop","ptt"}

Mode semantics (implemented in abstractvoice/vm/core.py):

  • full: keep listening and allow barge‑in (interrupt TTS on detected speech). Best with AEC or headset; speakers can cause self-interruption (mitigations exist; see echo gating in abstractvoice/recognition.py).
  • wait: pause microphone processing while speaking. No barge‑in and no stop‑phrase detection during TTS. Good for strict turn‑taking.
  • stop: keep listening, but suppress normal transcriptions during TTS and disable “interrupt on any speech”; a rolling stop‑phrase detector stays active so users can say “ok stop” to cut playback.
  • ptt: push‑to‑talk profile (thresholds tuned for short utterances). During TTS it behaves like stop mode; the integrator controls when to start/stop capture.

The REPL defaults to mic input off, and recommends --voice-mode stop for hands‑free usage; see docs/repl_guide.md.

Acoustic echo cancellation (optional)

  • enable_aec(enabled: bool = True, stream_delay_ms: int = 0) -> bool
  • Opt‑in AEC support for true barge‑in (requires abstractvoice[aec]).
  • Playback audio chunks are fed to the recognizer via abstractvoice/vm/core.pyVoiceRecognizer.feed_far_end_audio() in abstractvoice/recognition.py.

Voice cloning (optional; heavy)

Requires installing at least one cloning backend extra (and explicit artifact downloads; see docs/installation.md):

  • abstractvoice[cloning]f5_tts
  • abstractvoice[chroma]chroma
  • abstractvoice[audiodit]audiodit
  • abstractvoice[omnivoice]omnivoice

Remote clone-compatible endpoints can also be used without local cloning model weights by selecting cloning_engine="openai-compatible" (or engine="openai-compatible" per call). Configure remote_base_url or ABSTRACTVOICE_REMOTE_BASE_URL; the default clone endpoint is POST /voice/clone and must return a remote voice id (voice_id or id). The local clone store keeps a handle and routes later speak_to_bytes(..., voice=<local_id>) calls to remote /audio/speech with that remote voice id.

cloning_engine="openai" targets OpenAI's hosted API by default. Custom voice creation is provider/org gated and requires explicit consent configuration such as ABSTRACTVOICE_OPENAI_VOICE_CONSENT_ID; otherwise the adapter raises an actionable error instead of silently pretending cloning is standardized.

Core cloning calls:

  • clone_voice(reference_audio_path: str, name: str | None = None, *, reference_text: str | None = None, engine: str | None = None) -> str
  • clone_voice_from_wav_bytes(wav_bytes: bytes, name: str | None = None, *, reference_text: str | None = None, engine: str | None = None) -> str
  • speak(..., voice="<voice_id>") / speak_to_bytes(..., voice="<voice_id>") / speak_to_file(..., voice="<voice_id>")
  • list_cloned_voices(), get_cloned_voice(voice_id: str) -> dict

Clone management helpers:

  • set_cloned_voice_reference_text(voice_id: str, reference_text: str) -> bool
  • rename_cloned_voice(voice_id: str, new_name: str) -> bool
  • delete_cloned_voice(voice_id: str) -> bool
  • export_voice(voice_id: str, path: str) -> str, import_voice(path: str) -> str
  • set_cloned_tts_quality(preset: str) -> bool (low|standard|high; aliases: fast, balanced)
  • get_cloning_runtime_info() -> dict
  • unload_cloning_engines(*, keep_engine: str | None = None) -> int (best-effort memory relief)
  • unload_piper_voice() -> bool (best-effort memory relief)

For the user-facing workflow and commands, see docs/repl_guide.md.

Engine caveats that affect release choice are tracked in docs/known-issues.md.

Metrics (optional)

  • pop_last_tts_metrics() -> dict | None
  • Best-effort last-utterance stats used by the REPL verbose mode.

Callbacks & hooks

  • Per-utterance callback: speak(..., callback=...) (invoked after playback drains).
  • TTS lifecycle callbacks: vm.tts_engine.on_playback_start / vm.tts_engine.on_playback_end (synthesis/queue lifecycle).
  • Audio lifecycle callbacks (actual output): vm.on_audio_start / vm.on_audio_end / vm.on_audio_pause / vm.on_audio_resume (wired in abstractvoice/vm/core.py).

Explicit downloads (offline-first)

For offline deployments, prefetch explicitly (cross-platform):

python -m abstractvoice download --stt small
python -m abstractvoice download --piper en
python -m abstractvoice download --openf5   # optional; requires abstractvoice[cloning]
python -m abstractvoice download --chroma   # optional; requires abstractvoice[chroma] (GPU-heavy)
python -m abstractvoice download --audiodit # optional; requires abstractvoice[audiodit]
python -m abstractvoice download --omnivoice # optional; requires abstractvoice[omnivoice]

Or use the convenience entrypoint:

abstractvoice-prefetch --stt small
abstractvoice-prefetch --piper en
abstractvoice-prefetch --openf5            # optional; requires abstractvoice[cloning]
abstractvoice-prefetch --chroma            # optional; requires abstractvoice[chroma] (GPU-heavy)
abstractvoice-prefetch --audiodit          # optional; requires abstractvoice[audiodit]
abstractvoice-prefetch --omnivoice         # optional; requires abstractvoice[omnivoice]

Notes: - --chroma artifacts may require Hugging Face access to download.

See also: docs/installation.md, docs/model-management.md, and docs/voices-and-licenses.md.

Performance note: prefetch vs preload (important for servers)

  • Prefetch (download to disk): python -m abstractvoice download ... / abstractvoice-prefetch ...
  • Preload (load into memory): create a long-lived VoiceManager (or adapter) and reuse it.

If you construct a new VoiceManager for every request, heavy engines (AudioDiT/OmniVoice) will pay a large one-time cost repeatedly (imports + weight load + accelerator kernel compilation).

Recommended pattern (server/process startup):

from abstractvoice import VoiceManager

# Load once, reuse for all requests.
vm = VoiceManager(
    language="en",
    tts_engine="omnivoice",
    stt_engine="openai",
    remote_api_key="sk-...",
    allow_downloads=False,
)

Integrations (AbstractFramework ecosystem)

AbstractVoice is designed to work standalone, and also integrate cleanly into the AbstractFramework ecosystem (AbstractCore + AbstractRuntime). Overview and links: README.md.

Boundary note: - AbstractVoice owns the in-process voice backend (VoiceManager, adapters, model/cache policy). - AbstractCore owns agent orchestration, provider routing, capability selection, and OpenAI-compatible HTTP endpoints. - When both are installed, AbstractCore can expose AbstractVoice-backed audio endpoints such as POST /v1/audio/speech and POST /v1/audio/transcriptions.

AbstractCore capability plugin (auto-discovery)

AbstractVoice exposes an AbstractCore capability plugin entry point:

  • Entry point declaration: pyproject.toml[project.entry-points."abstractcore.capabilities_plugins"]
  • Implementation: abstractvoice/integrations/abstractcore_plugin.py

The plugin registers: - a voice backend (backend_id="abstractvoice:default") for TTS+STT - an audio backend (backend_id="abstractvoice:stt") for STT-only

Audio outputs can optionally be stored into an AbstractRuntime-like artifact_store via the duck-typed adapter in abstractvoice/artifacts.py.

The voice backend also exposes thin catalog discovery methods for Core/Gateway integration code: - list_profiles(kind="tts") -> list[dict] - list_tts_models() -> list[str] - voice_catalog() -> {kind, engine_id, active_profile, active_model, profiles, tts_models, catalog}

These methods delegate to the active VoiceManager and keep voice/profile/model semantics in AbstractVoice. AbstractCore still owns HTTP routing, auth, and browser/security policy.

Plugin configuration (owner config dict, best-effort). In AbstractCore integrations, the env/default path uses OpenAI remote TTS/STT (OPENAI_API_KEY, optional ABSTRACTVOICE_OPENAI_* overrides) unless owner config or ABSTRACTVOICE_TTS_ENGINE / ABSTRACTVOICE_STT_ENGINE selects a different engine: - voice_language: default language (e.g. "en") - voice_allow_downloads: allow on-demand downloads (bool) - voice_tts_engine: base TTS engine ("auto"|"piper"|"openai"|"openai-compatible"|"audiodit"|"omnivoice") - voice_stt_engine: STT engine ("auto"|"faster_whisper"|"openai"|"openai-compatible") - voice_tts_model: model id for remote TTS engines - voice_stt_model: model id for remote STT engines - voice_remote_base_url: base URL for OpenAI-compatible remote audio endpoints - voice_remote_api_key: optional bearer key for remote audio endpoints - voice_remote_timeout_s: request timeout for remote audio endpoints - voice_whisper_model: faster-whisper model size (e.g. "base", "small") - voice_cloning_engine: default cloning backend ("f5_tts"|"chroma"|"audiodit"|"omnivoice"|"openai"|"openai-compatible") - voice_cloned_tts_streaming: stream cloned-voice chunks for faster time-to-first-audio (bool). Used when voice_tts_delivery_mode is unset. - voice_tts_delivery_mode: unified audio delivery mode for base + cloned voices ("buffered"|"streamed"). Takes precedence over voice_cloned_tts_streaming. - voice_tts_streaming: bool alias for voice_tts_delivery_mode (true"streamed", false"buffered"). - voice_debug_mode: enable debug prints (bool)

Boolean owner config/env values accept common strings such as true, false, on, off, 1, and 0; string values like "false" are not treated as truthy.

Performance note: - The capability plugin caches VoiceManager instances in-process (keyed by the config above) so engines are not reloaded per request.

TTS metrics: - After synthesis, the plugin stores best-effort stats in artifact metadata under abstractvoice_tts (when artifact_store is used).

AbstractCore OpenAI-compatible audio endpoints

The production OpenAI-compatible HTTP server lives in AbstractCore. AbstractVoice also ships a local FastAPI web example (abstractvoice web) for package-level smoke testing, but the supported API server path is AbstractCore Server.

With abstractcore[server] and abstractvoice installed in the same environment, AbstractCore delegates its audio endpoints to the discovered capability plugin:

  • POST /v1/audio/speech -> core.voice.tts(...) -> VoiceManager.speak_to_bytes(...)
  • POST /v1/audio/transcriptions -> core.audio.transcribe(...) -> VoiceManager.transcribe_*()

Example:

OPENAI_API_KEY=... python -m abstractcore.server.app

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello.","format":"wav"}' \
  --output hello.wav

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@hello.wav" \
  -F "language=en"

If AbstractCore Server is configured with ABSTRACTCORE_SERVER_API_KEY, include the standard Authorization: Bearer <key> header. If the plugin is unavailable, AbstractCore returns 501 with install/config guidance instead of silently falling back.

For openai-compatible plugin configuration, do not set voice_remote_base_url to the same AbstractCore Server instance that is currently routing the /v1/audio/* request. That configuration recurses through the plugin path; use an upstream compatible provider/gateway, or select local engines explicitly.

The local abstractvoice web example exposes these smoke-test routes. They are example routes, not a replacement for AbstractCore Server, and they do not inherit AbstractCore/Gateway authentication or browser-origin policy:

abstractvoice web --tts-engine openai --stt-engine openai
abstractvoice web --tts-engine openai-compatible --stt-engine openai-compatible --remote-base-url http://localhost:8000/v1
  • GET /api/status -> lightweight server/config status
  • GET /api/voices -> VoiceManager.get_profiles(), list_available_models(), list_cloned_voices()
  • GET /v1/audio/voices -> compatible extension for remote profile/voice discovery (VoiceManager.get_profiles() + list_cloned_voices())
  • POST /api/voices/select -> select base TTS, a cloned voice, or a TTS profile; optional local role="assistant"|"user" stores browser-example defaults; optional preload=true warms a cloned voice by calling a tiny VoiceManager.speak_to_bytes(...)
  • POST /api/voices/clone -> example-only multipart upload for browser voice cloning; stores the uploaded/recorded reference with VoiceManager.clone_voice(...) and validates by synthesizing a short sample by default (validate=false skips validation)
  • POST /v1/voice/clone -> compatible extension for remote clone creation; returns voice_id/id for later /v1/audio/speech voice
  • POST /api/tts -> VoiceManager.speak_to_bytes(...); accepts input/text, voice, role, language, speed, format/response_format, and sanitize_syntax
  • POST /api/stt/transcriptions -> VoiceManager.transcribe_file(...)
  • POST /api/stt/transcribe -> compatibility alias for /api/stt/transcriptions
  • GET /api/llm/models -> example-only model listing for an OpenAI-compatible local provider such as Ollama
  • POST /api/chat -> example-only non-streaming chat completion proxy; the browser owns history and sends the full short message list
  • POST /v1/audio/speech and POST /v1/audio/transcriptions -> local aliases for quick AbstractCore-compatible smoke tests. In the web example, voice may be either a cloned voice id/name or an active-engine profile id.

Local web example payload sketches:

# Browser-example role default; still resolves to a cloned voice_id.
curl -X POST http://127.0.0.1:5000/api/voices/select \
  -H "Content-Type: application/json" \
  -d '{"role":"assistant","kind":"clone","voice":"my_voice","preload":true}'

# Browser-example cloned voice creation.
curl -X POST http://127.0.0.1:5000/api/voices/clone \
  -F "name=my_voice" \
  -F "engine=f5_tts" \
  -F "reference_text=Exact transcript of the reference audio." \
  -F "file=@reference.wav"

# TTS still maps to VoiceManager.speak_to_bytes(...).
curl -X POST http://127.0.0.1:5000/api/tts \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello.","role":"assistant","response_format":"wav"}' \
  --output hello.wav

# Compatible extension: discover profiles/cloned voices from another
# AbstractVoice client configured with remote_base_url=http://127.0.0.1:5000/v1.
curl http://127.0.0.1:5000/v1/audio/voices

# Compatible extension: create a remote cloned voice handle.
curl -X POST http://127.0.0.1:5000/v1/voice/clone \
  -F "name=my_remote_voice" \
  -F "reference_text=Exact transcript of the reference audio." \
  -F "file=@reference.wav"

# Example dialogue call via a local OpenAI-compatible provider.
curl -X POST http://127.0.0.1:5000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"provider":"ollama","model":"gemma3:1b","messages":[{"role":"user","content":"Say hi in one sentence."}]}'

The local role field is only a convenience for the example UI. AbstractCore compatibility remains the capability/plugin contract: clients may pass a voice cloned-voice id to TTS, and AbstractCore routes that to VoiceManager.speak_to_bytes(..., voice=...).

AbstractCore tool helpers (manual wiring)

If you prefer to wire tools explicitly, abstractvoice/integrations/abstractcore.py provides:

  • make_voice_tools(voice_manager, store) -> list[callable]
  • Requires abstractcore at runtime (it imports abstractcore.tool).
  • store can be a MediaStore-like object, or an AbstractRuntime-like ArtifactStore (adapted via RuntimeArtifactStoreAdapter in abstractvoice/artifacts.py).

Tools exposed by make_voice_tools(...) (current): - voice_tts(text, voice=None, format="wav", run_id=None) -> artifact_ref - voice_profile_list(kind="tts") -> {profiles, active_profile} - voice_profile_set(profile_id, kind="tts") -> {ok, active_profile} - audio_transcribe(audio_artifact|audio_b64, ...) -> {text, transcript_artifact}

Minimal sketch:

from abstractvoice import VoiceManager
from abstractvoice.integrations.abstractcore import make_voice_tools

vm = VoiceManager(remote_api_key="sk-...")
tools = make_voice_tools(voice_manager=vm, store=artifact_store)

Example (engine-agnostic profile selection):

vm = VoiceManager(tts_engine="omnivoice", allow_downloads=False)
vm.set_profile("female_01", kind="tts")
wav_bytes = vm.speak_to_bytes("Hello.", format="wav")

TTS metrics (library-level): - VoiceManager.speak_to_bytes(...) / VoiceManager.speak_to_file(...) record best-effort stats for the last synthesis. - Call vm.pop_last_tts_metrics() to retrieve and clear them (dict with fields like engine, synth_s, audio_s, rtf, sample_rate).

Non-contract surface (may change without notice)

  • CLI behavior (abstractvoice/examples/*)
  • Internal adapter details and model catalogs beyond the documented defaults