Skip to content

ElevenLabsTTS

Stream high-quality voice synthesis with ElevenLabs voice cloning, stability controls, and multilingual models.

Use ElevenLabsTTS for high-fidelity voice cloning and expressive synthesis. Text streams over a WebSocket connection using the ElevenLabs stream-input protocol, and audio chunks arrive incrementally for low-latency playback.

Prerequisites

Basic setup

import { CompositeVoice, DeepgramSTT, AnthropicLLM, ElevenLabsTTS } from '@lukeocodes/composite-voice';

const voice = new CompositeVoice({
  stt: new DeepgramSTT({ proxyUrl: '/api/proxy/deepgram' }),
  llm: new AnthropicLLM({
    proxyUrl: '/api/proxy/anthropic',
    model: 'claude-haiku-4-5',
  }),
  tts: new ElevenLabsTTS({
    proxyUrl: '/api/proxy/elevenlabs',
    voiceId: '21m00Tcm4TlvDq8ikWAM',
    modelId: 'eleven_turbo_v2_5',
    outputFormat: 'pcm_24000',
  }),
});

await voice.start();

Configuration options

OptionTypeDefaultDescription
apiKeystringElevenLabs API key (direct mode)
proxyUrlstringProxy server URL (recommended for production)
voiceIdstring(required)ElevenLabs voice ID
modelIdstring'eleven_turbo_v2_5'Synthesis model
stabilitynumber0.5Voice consistency (0 to 1)
similarityBoostnumber0.75Closeness to the original voice sample (0 to 1)
outputFormatstring'pcm_16000'Audio format string (see below)

Models

ModelDescription
eleven_turbo_v2_5Latest turbo model, optimized for low latency (default)
eleven_turbo_v2Previous-generation turbo model
eleven_multilingual_v2High-quality multilingual synthesis
eleven_monolingual_v1English-only legacy model

Output formats

FormatEncodingSample Rate
pcm_1600016-bit PCM16 kHz
pcm_2205016-bit PCM22.05 kHz
pcm_2400016-bit PCM24 kHz
pcm_4410016-bit PCM44.1 kHz
mp3_44100_128MP3 128 kbps44.1 kHz
ulaw_8000mu-law8 kHz (telephony)

Complete example

import { CompositeVoice, DeepgramSTT, AnthropicLLM, ElevenLabsTTS } from '@lukeocodes/composite-voice';

const tts = new ElevenLabsTTS({
  proxyUrl: '/api/proxy/elevenlabs',
  voiceId: '21m00Tcm4TlvDq8ikWAM',
  modelId: 'eleven_multilingual_v2',
  stability: 0.7,
  similarityBoost: 0.9,
  outputFormat: 'pcm_24000',
});

const voice = new CompositeVoice({
  stt: new DeepgramSTT({ proxyUrl: '/api/proxy/deepgram' }),
  llm: new AnthropicLLM({
    proxyUrl: '/api/proxy/anthropic',
    model: 'claude-haiku-4-5',
  }),
  tts,
  logging: { enabled: true, level: 'debug' },
});

voice.on('tts:start', () => console.log('Speaking...'));
voice.on('tts:end', () => console.log('Done speaking'));

await voice.start();

Voice cloning controls

ElevenLabs provides two parameters that control how closely the synthesized voice matches the original sample:

  • Stability (0 to 1) — Higher values produce more consistent, predictable output. Lower values add variation and expressiveness. Start at 0.5 and adjust to taste.
  • Similarity boost (0 to 1) — Higher values match the original voice more closely. Lower values allow more creative variation. Start at 0.75.
const tts = new ElevenLabsTTS({
  proxyUrl: '/api/proxy/elevenlabs',
  voiceId: 'your-cloned-voice-id',
  stability: 0.3,          // More expressive
  similarityBoost: 0.9,    // Stay close to the original voice
});

Streaming protocol

ElevenLabsTTS uses the ElevenLabs stream-input protocol with BOS (Beginning of Stream) and EOS (End of Stream) messages. The SDK handles this automatically. For standalone use:

const tts = new ElevenLabsTTS({
  proxyUrl: '/api/proxy/elevenlabs',
  voiceId: '21m00Tcm4TlvDq8ikWAM',
});

await tts.initialize();
await tts.connect();       // Opens WebSocket, sends BOS with voice settings

tts.onAudio((chunk) => {
  // chunk.data contains PCM or MP3 audio
});

tts.sendText('Hello, ');
tts.sendText('world!');
await tts.finalize();      // Sends EOS, flushes remaining audio
await tts.disconnect();

Tips

  • Find voice IDs in the ElevenLabs Voice Library or via the list voices API endpoint.
  • Use eleven_turbo_v2_5 for the lowest latency. Use eleven_multilingual_v2 when you need non-English languages.
  • For telephony applications, use ulaw_8000 output format.
  • Lower stability values work well for conversational, expressive speech. Higher values suit narration and formal content.

Further reading

© 2026 CompositeVoice. All rights reserved.

Font size
Contrast
Motion
Transparency