ElevenLabsSTT
Add real-time speech recognition with ElevenLabs Scribe V2 to your voice pipeline via WebSocket streaming.
Use ElevenLabsSTT when you need low-latency real-time transcription (~150ms) with 90+ language support, voice activity detection, and optional word-level timestamps via ElevenLabs’ Scribe V2 Realtime API.
Prerequisites
- An ElevenLabs API key
- No additional peer dependencies required
ElevenLabsSTT connects through a raw WebSocket managed by the SDK’s built-in WebSocketManager.
For production, set up a proxy server so your API key stays server-side. The proxy uses the same elevenlabsApiKey config used for ElevenLabsTTS.
Basic setup
import { CompositeVoice, ElevenLabsSTT, AnthropicLLM, NativeTTS } from '@lukeocodes/composite-voice';
const agent = new CompositeVoice({
stt: new ElevenLabsSTT({
proxyUrl: '/api/proxy/elevenlabs',
model: 'scribe_v2_realtime',
audioFormat: 'pcm_16000',
}),
llm: new AnthropicLLM({
proxyUrl: '/api/proxy/anthropic',
model: 'claude-haiku-4-5',
systemPrompt: 'You are a helpful voice assistant. Keep responses brief.',
}),
tts: new NativeTTS(),
});
await agent.start();
Configuration options
| Option | Type | Default | Description |
|---|---|---|---|
proxyUrl | string | — | URL of your CompositeVoice proxy endpoint (recommended) |
apiKey | string | — | ElevenLabs API key (development only) |
token | string | — | Single-use token for short-lived browser sessions |
model | string | 'scribe_v2_realtime' | Transcription model |
audioFormat | string | 'pcm_16000' | Audio encoding format (see table below) |
language | string | — | Language code (BCP 47 or ISO 639-3). Omit for auto-detection |
commitStrategy | 'vad' | 'manual' | 'vad' | How utterance boundaries are determined |
vadSilenceThresholdSecs | number | 1.5 | Seconds of silence before VAD commits |
vadThreshold | number | 0.4 | VAD sensitivity (0.0 to 1.0) |
minSpeechDurationMs | number | 100 | Minimum speech duration before detection |
minSilenceDurationMs | number | 100 | Minimum silence duration before segment ends |
includeTimestamps | boolean | false | Include word-level timestamps in results |
includeLanguageDetection | boolean | false | Include detected language in results |
previousText | string | — | Context text for the model (first chunk only, max ~50 chars) |
enableLogging | boolean | true | When false, enables zero-retention mode |
interimResults | boolean | true | Emit partial transcripts while the user speaks |
timeout | number | 10000 | Connection timeout in milliseconds |
See the API reference for the full list.
Audio formats
| Format | Encoding | Sample Rate |
|---|---|---|
pcm_16000 | 16-bit PCM | 16 kHz (recommended) |
pcm_22050 | 16-bit PCM | 22.05 kHz |
pcm_24000 | 16-bit PCM | 24 kHz |
pcm_44100 | 16-bit PCM | 44.1 kHz |
mulaw_8000 | mu-law | 8 kHz (telephony) |
Commit strategies
ElevenLabsSTT supports two strategies for determining when a speech segment is finalized:
- VAD (default) — The server automatically detects speech pauses and commits. Best for microphone input. Configure sensitivity with
vadSilenceThresholdSecsandvadThreshold. - Manual — Your application controls when to commit by calling
sendCommit(). Useful when you need precise control over segment boundaries.
Authentication methods
Three methods are supported, with this priority order: proxyUrl > token > apiKey.
- Proxy (recommended) — API key stays server-side. Uses the same
elevenlabsApiKeyconfig as ElevenLabsTTS. - Single-use token — Generate a short-lived token server-side via the ElevenLabs API, pass it to the browser. Expires after 15 minutes.
- API key — Direct connection for development only. Never expose in production.
Language codes
ElevenLabs Scribe V2 uses ISO 639-3 (3-letter) codes internally. ElevenLabsSTT accepts all common formats and converts automatically:
- BCP 47:
en-US,fr-FR,pt-BR— extracted and mapped - ISO 639-1:
en,fr,de— mapped to ISO 639-3 - ISO 639-3:
eng,fra,deu— passed through directly
Omit the language option entirely to enable auto-detection (90+ languages).
Complete example
import { CompositeVoice, ElevenLabsSTT, AnthropicLLM, ElevenLabsTTS } from '@lukeocodes/composite-voice';
const agent = new CompositeVoice({
stt: new ElevenLabsSTT({
proxyUrl: '/api/proxy/elevenlabs',
model: 'scribe_v2_realtime',
audioFormat: 'pcm_16000',
language: 'en',
commitStrategy: 'vad',
includeTimestamps: true,
vadSilenceThresholdSecs: 1.5,
}),
llm: new AnthropicLLM({
proxyUrl: '/api/proxy/anthropic',
model: 'claude-haiku-4-5',
maxTokens: 256,
systemPrompt: 'You are a helpful voice assistant. Keep responses under two sentences.',
}),
tts: new ElevenLabsTTS({
proxyUrl: '/api/proxy/elevenlabs',
voiceId: '21m00Tcm4TlvDq8ikWAM',
modelId: 'eleven_turbo_v2_5',
}),
conversationHistory: { enabled: true, maxTurns: 10 },
logging: { enabled: true, level: 'info' },
});
agent.on('transcription:final', (event) => {
console.log('User said:', event.text);
});
agent.on('response:text', (event) => {
console.log('Assistant:', event.text);
});
await agent.start();
Tips and gotchas
- Always use a proxy in production. Pass
proxyUrlinstead ofapiKeyso your ElevenLabs key never reaches the browser. The SDK convertshttp(s)tows(s)automatically. The proxy shares the sameelevenlabsApiKeywith ElevenLabsTTS. - No peer dependencies. Like AssemblyAISTT, ElevenLabsSTT uses the SDK’s built-in
WebSocketManager— no extra packages to install. - Audio is base64-encoded. The provider converts raw
ArrayBufferaudio into base64 JSON messages ({ audio_base_64: "..." }) before sending. This is handled automatically. - VAD mode is recommended for microphone input. It automatically detects speech pauses and commits segments. Use manual mode only when you need explicit control.
previousTextis sent once. The context string is included only on the first audio chunk to help the model. It is not repeated on subsequent chunks.- No preflight signals. ElevenLabsSTT does not emit preflight/eager end-of-turn events. If you need the eager LLM pipeline, use DeepgramFlux instead.
- Session handshake. The
connect()call waits for asession_startedmessage from the server before resolving, ensuring the session is fully initialized before audio streaming begins. - Auto-commit fallback. Even in manual mode, the server auto-commits after 90 seconds of buffered audio.
Related resources
- ElevenLabs STT example — standalone STT with proxy
- ElevenLabs pipeline example — full ElevenLabs STT + TTS pipeline
- Proxy server example — secure your API key server-side
- ElevenLabsTTS guide — ElevenLabs text-to-speech provider
- API reference: ElevenLabsSTT
- Providers reference