Skip to content

Providers

Every STT, LLM, and TTS provider — supported models, transport, features, and configuration.

CompositeVoice uses three provider slots — STT (speech-to-text), LLM (large language model), and TTS (text-to-speech). Mix and match any combination to build your voice pipeline.

Speech-to-Text (STT)

ProviderTransportModelsInterim ResultsPreflight
NativeSTTBrowser APIBrowser defaultYesNo
DeepgramSTTWebSocketV1: nova-3, nova-2YesNo
DeepgramFluxWebSocketV2: flux-general-enYesYes
AssemblyAISTTWebSocketDefault modelYesNo
ElevenLabsSTTWebSocketscribe_v2_realtimeYesNo

NativeSTT

Uses the browser’s built-in Web Speech API. Zero API keys required. Best for prototyping and demos.

import { NativeSTT } from '@lukeocodes/composite-voice';

const stt = new NativeSTT({
  language: 'en-US',        // BCP 47 language tag
  continuous: true,          // keep listening after each result
  interimResults: true,      // emit partial transcripts
  maxAlternatives: 1,        // number of recognition alternatives
});
  • No API key needed
  • Works offline
  • Supports 50+ languages via the browser
  • Managed audio — the browser controls the microphone directly
  • Does not work in de-Googled browsers (Ungoogled Chromium, Brave) — the Web Speech API requires Google’s speech servers

API reference

DeepgramSTT

Production-grade real-time speech recognition via WebSocket using Deepgram’s V1 (Nova) API. Best accuracy across the widest range of languages.

import { DeepgramSTT } from '@lukeocodes/composite-voice';

const stt = new DeepgramSTT({
  proxyUrl: '/api/proxy/deepgram',   // server proxy (recommended)
  // OR: apiKey: 'dg-...',           // direct API key (dev only)
  language: 'en',
  interimResults: true,
  options: {
    model: 'nova-3',          // nova-3 (recommended), nova-2, nova-3-medical
    smartFormat: true,         // auto-punctuation and formatting
    punctuation: true,
    profanityFilter: false,
    diarize: false,            // speaker identification
    endpointing: 300,          // ms of silence before end-of-speech
    utteranceEndMs: 1000,      // ms before utterance boundary
  },
});
  • nova-3 (highest accuracy, recommended default), nova-2 (wider language support)
  • Word-level confidence and timestamps
  • Smart formatting and auto-punctuation
  • Profanity filtering
  • Speaker diarization
  • VAD events

Does not support preflight/eager end-of-turn signals. For the eager LLM pipeline, use DeepgramFlux.

API reference

DeepgramFlux

Low-latency real-time speech recognition via WebSocket using Deepgram’s V2 (Flux) API. Supports eager end-of-turn signals for the eager LLM pipeline.

import { DeepgramFlux } from '@lukeocodes/composite-voice';

const stt = new DeepgramFlux({
  proxyUrl: '/api/proxy/deepgram',   // server proxy (recommended)
  // OR: apiKey: 'dg-...',           // direct API key (dev only)
  options: {
    model: 'flux-general-en',
    eagerEotThreshold: 0.5,    // enables eager end-of-turn signals
    eotThreshold: 0.7,
  },
});
  • Turn-based transcription via TurnInfo events
  • Eager end-of-turn signals (EagerEndOfTurnisPreflight: true)
  • Configurable end-of-turn confidence thresholds
  • Keyterm boosting for domain vocabulary
  • Only STT provider that supports the eager LLM pipeline

API reference

AssemblyAISTT

Real-time speech recognition via WebSocket with word boosting for domain-specific vocabulary.

import { AssemblyAISTT } from '@lukeocodes/composite-voice';

const stt = new AssemblyAISTT({
  proxyUrl: '/api/proxy/assemblyai',
  // OR: apiKey: '...',
  sampleRate: 16000,
  language: 'en',
  wordBoost: ['CompositeVoice', 'WebSocket'],  // boost domain terms
});
  • Word boosting for domain vocabulary
  • Word-level timestamps and confidence
  • Automatic reconnection

API reference

ElevenLabsSTT

Real-time speech recognition via WebSocket using ElevenLabs Scribe V2 with ~150ms latency and 90+ language support.

import { ElevenLabsSTT } from '@lukeocodes/composite-voice';

const stt = new ElevenLabsSTT({
  proxyUrl: '/api/proxy/elevenlabs',
  // OR: apiKey: '...',
  // OR: token: '...',             // single-use token
  model: 'scribe_v2_realtime',
  audioFormat: 'pcm_16000',
  language: 'en',                  // BCP 47, ISO 639-1, or ISO 639-3
  commitStrategy: 'vad',           // 'vad' (default) or 'manual'
  includeTimestamps: true,         // word-level timestamps
});
  • VAD and manual commit strategies
  • 90+ languages with auto-detection
  • Word-level timestamps and confidence
  • Three auth methods (API key, proxy, single-use token)
  • Shares proxy config with ElevenLabsTTS

API reference


Large Language Models (LLM)

ProviderBaseDefault ModelStreaming
AnthropicLLMCustomclaude-haiku-4-5Yes
OpenAILLMOpenAI-compatible(required)Yes
GroqLLMOpenAI-compatiblellama-3.3-70b-versatileYes
MistralLLMOpenAI-compatiblemistral-small-latestYes
GeminiLLMOpenAI-compatiblegemini-2.0-flashYes
WebLLMLLMCustom(required)Yes
OpenAICompatibleLLM(required)Yes

AnthropicLLM

Claude models via the Anthropic API. Uses a dedicated SDK (not OpenAI-compatible).

import { AnthropicLLM } from '@lukeocodes/composite-voice';

const llm = new AnthropicLLM({
  proxyUrl: '/api/proxy/anthropic',
  model: 'claude-haiku-4-5',    // claude-haiku-4-5, claude-sonnet-4-5, claude-opus-4-5
  maxTokens: 1024,               // required (default: 1024)
});
  • System prompts at top level (Anthropic API convention)
  • Streaming via SSE
  • AbortSignal cancellation for the eager pipeline

API reference

OpenAILLM

GPT models via the OpenAI API.

import { OpenAILLM } from '@lukeocodes/composite-voice';

const llm = new OpenAILLM({
  proxyUrl: '/api/proxy/openai',
  model: 'gpt-4o-mini',
  // organizationId: 'org-...',  // for multi-org accounts
});

API reference

GroqLLM

Ultra-fast inference on Groq’s LPU hardware. Supports open-source models.

import { GroqLLM } from '@lukeocodes/composite-voice';

const llm = new GroqLLM({
  proxyUrl: '/api/proxy/groq',
  model: 'llama-3.3-70b-versatile',  // or mixtral-8x7b-32768, gemma2-9b-it
});
  • Lowest latency of any cloud LLM provider
  • Wide range of open-source models

API reference

MistralLLM

Mistral models with strong multilingual support.

import { MistralLLM } from '@lukeocodes/composite-voice';

const llm = new MistralLLM({
  proxyUrl: '/api/proxy/mistral',
  model: 'mistral-small-latest',  // or mistral-medium-latest, mistral-large-latest
});

API reference

GeminiLLM

Google Gemini models via their OpenAI-compatible endpoint.

import { GeminiLLM } from '@lukeocodes/composite-voice';

const llm = new GeminiLLM({
  proxyUrl: '/api/proxy/gemini',
  model: 'gemini-2.0-flash',  // or gemini-1.5-pro, gemini-1.5-flash
});

API reference

WebLLMLLM

Run LLMs entirely in the browser via WebGPU. No API keys, no network, full privacy.

import { WebLLMLLM } from '@lukeocodes/composite-voice';

const llm = new WebLLMLLM({
  model: 'Llama-3.2-1B-Instruct-q4f16_1-MLC',
  onLoadProgress: (progress) => {
    console.log(`Loading: ${(progress.progress * 100).toFixed(0)}%`);
  },
});
  • All data stays in the browser
  • Works offline after initial model download
  • Requires a WebGPU-capable browser
  • First load downloads model weights (100+ MB)

API reference

OpenAICompatibleLLM

Base class for any service that speaks the OpenAI chat completions format. Use this to connect custom or self-hosted models.

import { OpenAICompatibleLLM } from '@lukeocodes/composite-voice';

const llm = new OpenAICompatibleLLM({
  baseURL: 'https://my-model-server.example.com/v1',
  model: 'my-custom-model',
  apiKey: '...',
});

API reference


Text-to-Speech (TTS)

ProviderTransportVoicesStreamingAudio Format
NativeTTSBrowser APISystem voicesNo (managed)N/A
DeepgramTTSWebSocketAura 2 (7 voices)Yeslinear16, mulaw, alaw
OpenAITTSREST6 voicesNomp3, opus, aac, flac, wav
ElevenLabsTTSWebSocketCustom voice IDsYespcm, mp3, ulaw
CartesiaTTSWebSocketCustom voice IDsYespcm (s16le, f32le, mulaw, alaw)

NativeTTS

Uses the browser’s built-in SpeechSynthesis API. Zero API keys required.

import { NativeTTS } from '@lukeocodes/composite-voice';

const tts = new NativeTTS({
  voiceName: 'Samantha',    // partial match against available voices
  voiceLang: 'en-US',       // BCP 47 fallback filter
  rate: 1.0,                // speech rate
  pitch: 0,                 // semitones (-20 to 20)
});
  • No API key needed
  • Works offline
  • Managed audio — the browser plays directly
  • Supports pause, resume, and cancel
  • Voice enumeration via getAvailableVoices()

API reference

DeepgramTTS

Low-latency real-time streaming TTS via WebSocket with Aura 2 voices.

import { DeepgramTTS } from '@lukeocodes/composite-voice';

const tts = new DeepgramTTS({
  proxyUrl: '/api/proxy/deepgram',
  voice: 'aura-2-thalia-en',    // thalia, andromeda, janus, proteus, orion, luna, arcas
  sampleRate: 24000,
  outputFormat: 'linear16',
});
  • Lowest latency streaming TTS
  • Word-level timing metadata
  • Aura 2 voice models

API reference

OpenAITTS

OpenAI text-to-speech via REST. Returns complete audio in one request.

import { OpenAITTS } from '@lukeocodes/composite-voice';

const tts = new OpenAITTS({
  proxyUrl: '/api/proxy/openai',
  model: 'tts-1',          // tts-1 (fast) or tts-1-hd (quality)
  voice: 'nova',           // alloy, echo, fable, onyx, nova, shimmer
  responseFormat: 'mp3',   // mp3, opus, aac, flac, wav
  speed: 1.0,              // 0.25 to 4.0
});
  • Six distinct voices
  • Quality/speed tradeoff via model selection
  • 4096 character limit per request

API reference

ElevenLabsTTS

High-quality voice cloning and synthesis via WebSocket streaming.

import { ElevenLabsTTS } from '@lukeocodes/composite-voice';

const tts = new ElevenLabsTTS({
  proxyUrl: '/api/proxy/elevenlabs',
  voiceId: 'your-voice-id',           // from ElevenLabs dashboard
  modelId: 'eleven_turbo_v2_5',       // turbo_v2_5, turbo_v2, multilingual_v2
  stability: 0.5,                      // voice consistency (0-1)
  similarityBoost: 0.75,              // voice fidelity (0-1)
  outputFormat: 'pcm_16000',          // pcm_16000, pcm_22050, pcm_24000, mp3_44100_128
});
  • Voice cloning
  • Multilingual models
  • Stability and similarity controls
  • Multiple output formats

API reference

CartesiaTTS

Ultra-low-latency streaming TTS with emotion controls.

import { CartesiaTTS } from '@lukeocodes/composite-voice';

const tts = new CartesiaTTS({
  proxyUrl: '/api/proxy/cartesia',
  voiceId: 'your-voice-id',
  modelId: 'sonic-2',           // sonic-2 (latest), sonic, sonic-multilingual
  language: 'en',
  outputEncoding: 'pcm_s16le',
  outputSampleRate: 16000,
  speed: 'normal',              // or 'slow', 'fast'
  emotion: ['positivity:high'], // emotion tags
});
  • Context-based streaming links chunks into coherent utterances
  • Emotion controls
  • Word-level timestamps
  • sonic-2 model delivers the lowest latency

API reference


Choosing providers

For prototyping: NativeSTT + any LLM + NativeTTS — no API keys except the LLM.

For production: DeepgramSTT + AnthropicLLM + DeepgramTTS — best accuracy, lowest latency, streaming throughout.

For privacy: NativeSTT + WebLLMLLM + NativeTTS — everything runs in the browser. No data leaves the device.

For lowest latency: DeepgramFlux + GroqLLM + DeepgramTTS — eager end-of-turn signals, fastest LLM inference, low-latency streaming TTS.

© 2026 CompositeVoice. All rights reserved.

Font size
Contrast
Motion
Transparency