Skip to content

ElevenLabsSTT

Add real-time speech recognition with ElevenLabs Scribe V2 to your voice pipeline via WebSocket streaming.

Use ElevenLabsSTT when you need low-latency real-time transcription (~150ms) with 90+ language support, voice activity detection, and optional word-level timestamps via ElevenLabs’ Scribe V2 Realtime API.

Prerequisites

  • An ElevenLabs API key
  • No additional peer dependencies required

ElevenLabsSTT connects through a raw WebSocket managed by the SDK’s built-in WebSocketManager.

For production, set up a proxy server so your API key stays server-side. The proxy uses the same elevenlabsApiKey config used for ElevenLabsTTS.

Basic setup

import { CompositeVoice, ElevenLabsSTT, AnthropicLLM, NativeTTS } from '@lukeocodes/composite-voice';

const agent = new CompositeVoice({
  stt: new ElevenLabsSTT({
    proxyUrl: '/api/proxy/elevenlabs',
    model: 'scribe_v2_realtime',
    audioFormat: 'pcm_16000',
  }),
  llm: new AnthropicLLM({
    proxyUrl: '/api/proxy/anthropic',
    model: 'claude-haiku-4-5',
    systemPrompt: 'You are a helpful voice assistant. Keep responses brief.',
  }),
  tts: new NativeTTS(),
});

await agent.start();

Configuration options

OptionTypeDefaultDescription
proxyUrlstringURL of your CompositeVoice proxy endpoint (recommended)
apiKeystringElevenLabs API key (development only)
tokenstringSingle-use token for short-lived browser sessions
modelstring'scribe_v2_realtime'Transcription model
audioFormatstring'pcm_16000'Audio encoding format (see table below)
languagestringLanguage code (BCP 47 or ISO 639-3). Omit for auto-detection
commitStrategy'vad' | 'manual''vad'How utterance boundaries are determined
vadSilenceThresholdSecsnumber1.5Seconds of silence before VAD commits
vadThresholdnumber0.4VAD sensitivity (0.0 to 1.0)
minSpeechDurationMsnumber100Minimum speech duration before detection
minSilenceDurationMsnumber100Minimum silence duration before segment ends
includeTimestampsbooleanfalseInclude word-level timestamps in results
includeLanguageDetectionbooleanfalseInclude detected language in results
previousTextstringContext text for the model (first chunk only, max ~50 chars)
enableLoggingbooleantrueWhen false, enables zero-retention mode
interimResultsbooleantrueEmit partial transcripts while the user speaks
timeoutnumber10000Connection timeout in milliseconds

See the API reference for the full list.

Audio formats

FormatEncodingSample Rate
pcm_1600016-bit PCM16 kHz (recommended)
pcm_2205016-bit PCM22.05 kHz
pcm_2400016-bit PCM24 kHz
pcm_4410016-bit PCM44.1 kHz
mulaw_8000mu-law8 kHz (telephony)

Commit strategies

ElevenLabsSTT supports two strategies for determining when a speech segment is finalized:

  • VAD (default) — The server automatically detects speech pauses and commits. Best for microphone input. Configure sensitivity with vadSilenceThresholdSecs and vadThreshold.
  • Manual — Your application controls when to commit by calling sendCommit(). Useful when you need precise control over segment boundaries.

Authentication methods

Three methods are supported, with this priority order: proxyUrl > token > apiKey.

  • Proxy (recommended) — API key stays server-side. Uses the same elevenlabsApiKey config as ElevenLabsTTS.
  • Single-use token — Generate a short-lived token server-side via the ElevenLabs API, pass it to the browser. Expires after 15 minutes.
  • API key — Direct connection for development only. Never expose in production.

Language codes

ElevenLabs Scribe V2 uses ISO 639-3 (3-letter) codes internally. ElevenLabsSTT accepts all common formats and converts automatically:

  • BCP 47: en-US, fr-FR, pt-BR — extracted and mapped
  • ISO 639-1: en, fr, de — mapped to ISO 639-3
  • ISO 639-3: eng, fra, deu — passed through directly

Omit the language option entirely to enable auto-detection (90+ languages).

Complete example

import { CompositeVoice, ElevenLabsSTT, AnthropicLLM, ElevenLabsTTS } from '@lukeocodes/composite-voice';

const agent = new CompositeVoice({
  stt: new ElevenLabsSTT({
    proxyUrl: '/api/proxy/elevenlabs',
    model: 'scribe_v2_realtime',
    audioFormat: 'pcm_16000',
    language: 'en',
    commitStrategy: 'vad',
    includeTimestamps: true,
    vadSilenceThresholdSecs: 1.5,
  }),
  llm: new AnthropicLLM({
    proxyUrl: '/api/proxy/anthropic',
    model: 'claude-haiku-4-5',
    maxTokens: 256,
    systemPrompt: 'You are a helpful voice assistant. Keep responses under two sentences.',
  }),
  tts: new ElevenLabsTTS({
    proxyUrl: '/api/proxy/elevenlabs',
    voiceId: '21m00Tcm4TlvDq8ikWAM',
    modelId: 'eleven_turbo_v2_5',
  }),
  conversationHistory: { enabled: true, maxTurns: 10 },
  logging: { enabled: true, level: 'info' },
});

agent.on('transcription:final', (event) => {
  console.log('User said:', event.text);
});

agent.on('response:text', (event) => {
  console.log('Assistant:', event.text);
});

await agent.start();

Tips and gotchas

  • Always use a proxy in production. Pass proxyUrl instead of apiKey so your ElevenLabs key never reaches the browser. The SDK converts http(s) to ws(s) automatically. The proxy shares the same elevenlabsApiKey with ElevenLabsTTS.
  • No peer dependencies. Like AssemblyAISTT, ElevenLabsSTT uses the SDK’s built-in WebSocketManager — no extra packages to install.
  • Audio is base64-encoded. The provider converts raw ArrayBuffer audio into base64 JSON messages ({ audio_base_64: "..." }) before sending. This is handled automatically.
  • VAD mode is recommended for microphone input. It automatically detects speech pauses and commits segments. Use manual mode only when you need explicit control.
  • previousText is sent once. The context string is included only on the first audio chunk to help the model. It is not repeated on subsequent chunks.
  • No preflight signals. ElevenLabsSTT does not emit preflight/eager end-of-turn events. If you need the eager LLM pipeline, use DeepgramFlux instead.
  • Session handshake. The connect() call waits for a session_started message from the server before resolving, ensuring the session is fully initialized before audio streaming begins.
  • Auto-commit fallback. Even in manual mode, the server auto-commits after 90 seconds of buffered audio.

© 2026 CompositeVoice. All rights reserved.

Font size
Contrast
Motion
Transparency