Skip to content

CartesiaTTS

Stream ultra-low-latency speech synthesis with Cartesia's Sonic models, emotion controls, and context-based streaming.

Use CartesiaTTS for the lowest-latency streaming synthesis available. Cartesia’s Sonic models deliver fast time-to-first-byte, and the context-based streaming protocol links multiple text chunks into a single coherent utterance with consistent prosody.

Prerequisites

Basic setup

import { CompositeVoice, DeepgramSTT, AnthropicLLM, CartesiaTTS } from '@lukeocodes/composite-voice';

const voice = new CompositeVoice({
  stt: new DeepgramSTT({ proxyUrl: '/api/proxy/deepgram' }),
  llm: new AnthropicLLM({
    proxyUrl: '/api/proxy/anthropic',
    model: 'claude-haiku-4-5',
  }),
  tts: new CartesiaTTS({
    proxyUrl: '/api/proxy/cartesia',
    voiceId: 'a0e99841-438c-4a64-b679-ae501e7d6091',
    modelId: 'sonic-2',
    outputEncoding: 'pcm_s16le',
    outputSampleRate: 24000,
  }),
});

await voice.start();

Configuration options

OptionTypeDefaultDescription
apiKeystringCartesia API key (direct mode)
proxyUrlstringProxy server URL (recommended for production)
voiceIdstring(required)Cartesia voice ID
modelIdstring'sonic-2'Synthesis model
languagestring'en'BCP 47 language code
outputEncodingstring'pcm_s16le'Audio encoding format
outputSampleRatenumber16000Sample rate in Hz
speednumberSpeech speed multiplier (>1 faster, <1 slower)
emotionstring[]Emotion tags for voice expression
cartesiaVersionstring'2024-06-10'API version string

Models

ModelDescription
sonic-2Latest model, best quality and speed (default)
sonicPrevious-generation model
sonic-multilingualMulti-language support

Output encodings

EncodingDescription
pcm_s16le16-bit signed little-endian PCM (default)
pcm_f32le32-bit float little-endian PCM
pcm_mulawmu-law PCM (telephony)
pcm_alawA-law PCM (telephony)

Complete example

import { CompositeVoice, DeepgramSTT, AnthropicLLM, CartesiaTTS } from '@lukeocodes/composite-voice';

const tts = new CartesiaTTS({
  proxyUrl: '/api/proxy/cartesia',
  voiceId: 'a0e99841-438c-4a64-b679-ae501e7d6091',
  modelId: 'sonic-2',
  language: 'en',
  outputEncoding: 'pcm_s16le',
  outputSampleRate: 24000,
  emotion: ['positivity:high', 'curiosity'],
});

const voice = new CompositeVoice({
  stt: new DeepgramSTT({ proxyUrl: '/api/proxy/deepgram' }),
  llm: new AnthropicLLM({
    proxyUrl: '/api/proxy/anthropic',
    model: 'claude-haiku-4-5',
  }),
  tts,
  logging: { enabled: true, level: 'debug' },
});

voice.on('tts:start', () => console.log('Speaking...'));
voice.on('tts:end', () => console.log('Done speaking'));

await voice.start();

Emotion tags

Cartesia supports emotion controls that shape the voice’s expressiveness. Pass an array of tags to the emotion option.

const tts = new CartesiaTTS({
  proxyUrl: '/api/proxy/cartesia',
  voiceId: 'a0e99841-438c-4a64-b679-ae501e7d6091',
  emotion: ['positivity:high', 'curiosity'],
});

Tags follow the format emotion_name or emotion_name:intensity. Examples: 'positivity:high', 'curiosity', 'anger:low', 'surprise'. Combine multiple tags to create nuanced expressions.

Context-based streaming

CartesiaTTS uses a context_id to link multiple text chunks into a single coherent utterance. The first chunk starts a new context, and subsequent chunks continue it with continue: true. This preserves prosody and intonation across chunk boundaries.

The SDK handles context management automatically. For standalone use:

const tts = new CartesiaTTS({
  proxyUrl: '/api/proxy/cartesia',
  voiceId: 'a0e99841-438c-4a64-b679-ae501e7d6091',
});

await tts.initialize();
await tts.connect();       // Opens WebSocket, generates context ID

tts.onAudio((chunk) => {
  // chunk.data contains raw PCM audio
});

tts.sendText('Hello, ');   // First chunk: continue = false
tts.sendText('world!');    // Continuation: continue = true
await tts.finalize();      // Sends end-of-input, resets context
await tts.disconnect();

After finalize(), the context resets. The next sendText() call starts a fresh utterance with a new context ID.

Tips

  • Use sonic-2 for the best combination of latency and quality. Use sonic-multilingual for non-English languages.
  • The speed option accepts any positive number. Values above 1 speed up speech; values below 1 slow it down.
  • Combine emotion tags for richer expression — e.g., ['positivity:high', 'curiosity'] for an enthusiastic, inquisitive tone.
  • For telephony, use pcm_mulaw or pcm_alaw encoding with an 8000 Hz sample rate.
  • Cartesia emits word-level timestamp events during streaming. These arrive as metadata alongside audio chunks.

Further reading

© 2026 CompositeVoice. All rights reserved.

Font size
Contrast
Motion
Transparency