Turn-Taking
Control how the SDK manages microphone capture during agent speech — strategies for echo prevention and barge-in.
What is turn-taking?
When a voice agent speaks through the speakers, the microphone can pick up that audio and feed it back to the STT provider. The STT provider then transcribes the agent’s own speech, which triggers a new LLM request, which triggers more TTS playback — creating an infinite feedback loop.
Turn-taking is the system that prevents this. It controls whether the SDK pauses microphone capture while the agent is speaking, based on the provider combination, browser capabilities, and your configuration.
The pauseCaptureOnPlayback setting
The top-level control is pauseCaptureOnPlayback, which accepts three values:
| Value | Behavior |
|---|---|
'auto' | The SDK decides whether to pause based on the configured strategy and provider combination. |
true | Always pause microphone capture during TTS playback. Prevents all echo. |
false | Never pause microphone capture. Full-duplex mode — requires reliable hardware echo cancellation. |
The default is 'auto'.
import { CompositeVoice, NativeSTT, AnthropicLLM, NativeTTS } from '@lukeocodes/composite-voice';
const agent = new CompositeVoice({
stt: new NativeSTT(),
llm: new AnthropicLLM({ apiKey: 'sk-ant-...', model: 'claude-haiku-4-5-20251001' }),
tts: new NativeTTS(),
turnTaking: {
pauseCaptureOnPlayback: 'auto',
},
});
Auto strategies
When pauseCaptureOnPlayback is 'auto', the SDK uses the autoStrategy setting to decide. There are three strategies:
Conservative (default)
Pauses microphone capture unless the STT provider uses navigator.mediaDevices.getUserMedia(), which supports browser-level echo cancellation.
In practice, this means:
- NativeSTT (Web Speech API) — always pauses, because the SpeechRecognition API has no echo cancellation support
- DeepgramSTT and AssemblyAISTT (MediaDevices) — does not pause, because getUserMedia can enable
echoCancellation: true
const agent = new CompositeVoice({
stt, llm, tts,
turnTaking: {
pauseCaptureOnPlayback: 'auto',
autoStrategy: 'conservative',
},
});
This is the safest auto strategy and is the default.
Aggressive
Only pauses for provider combinations explicitly listed in alwaysPauseCombinations. All other combinations run in full-duplex mode.
const agent = new CompositeVoice({
stt, llm, tts,
turnTaking: {
pauseCaptureOnPlayback: 'auto',
autoStrategy: 'aggressive',
},
});
With the default alwaysPauseCombinations, this still pauses for any STT combination that includes NativeSTT, because NativeSTT is listed with a wildcard TTS match (see below). For WebSocket-based STT providers like DeepgramSTT, aggressive mode allows full-duplex.
Detect
Attempts to detect echo cancellation support at runtime by checking the browser’s navigator.mediaDevices.getSupportedConstraints() API. If the browser reports support for echoCancellation, noiseSuppression, and autoGainControl, the SDK allows full-duplex. Otherwise, it pauses.
const agent = new CompositeVoice({
stt, llm, tts,
turnTaking: {
pauseCaptureOnPlayback: 'auto',
autoStrategy: 'detect',
},
});
The detect strategy checks two things:
- Whether the STT provider uses MediaDevices (SpeechRecognition API providers always get paused regardless of browser support)
- Whether the browser supports the required audio processing constraints
This is the most adaptive strategy, but note that it checks browser capability, not whether echo cancellation is actually working well with the user’s hardware. Laptops with poor speaker/microphone isolation may still produce echo even when the browser reports support.
The alwaysPauseCombinations list
When using the 'aggressive' auto strategy, the SDK checks each STT/TTS provider pair against a list of known problematic combinations. If the combination matches, capture is paused regardless of the strategy.
The default list is:
alwaysPauseCombinations: [
{ stt: 'NativeSTT', tts: 'NativeTTS' },
{ stt: 'NativeSTT', tts: 'any' }, // NativeSTT always needs pause
]
The special value 'any' acts as a wildcard that matches any provider name. In the default configuration, every combination that uses NativeSTT will pause, because NativeSTT uses the Web Speech API which has no echo cancellation.
You can override this list to add your own known-bad combinations:
const agent = new CompositeVoice({
stt, llm, tts,
turnTaking: {
pauseCaptureOnPlayback: 'auto',
autoStrategy: 'aggressive',
alwaysPauseCombinations: [
{ stt: 'NativeSTT', tts: 'any' },
{ stt: 'MyCustomSTT', tts: 'NativeTTS' },
],
},
});
When to use each strategy
| Scenario | Recommended setting |
|---|---|
| Using NativeSTT (Web Speech API) | 'auto' with 'conservative' (default) — NativeSTT always needs pause |
| Using DeepgramSTT or AssemblyAISTT on a laptop | 'auto' with 'conservative' or 'detect' |
| Using DeepgramSTT with external speakers + good mic | 'auto' with 'aggressive' or pauseCaptureOnPlayback: false |
| Headphones (no echo possible) | pauseCaptureOnPlayback: false |
| Unsure about the user’s audio setup | pauseCaptureOnPlayback: true (always safe) |
| Kiosk or embedded device with known hardware | pauseCaptureOnPlayback: false after testing |
Barge-in behavior
Barge-in is the ability for the user to interrupt the agent while it is speaking. How barge-in works depends on the turn-taking configuration:
When pauseCaptureOnPlayback resolves to true: The microphone is paused during playback. The user cannot interrupt the agent by speaking — they must wait for the agent to finish. This prevents echo but disables natural interruption.
When pauseCaptureOnPlayback resolves to false: The microphone stays active during playback (full-duplex mode). If the user speaks while the agent is talking, the STT provider picks up their speech and fires transcription events. You can use these events to implement barge-in by calling stopSpeaking():
agent.on('transcription.interim', async ({ text }) => {
if (agent.getState() === 'speaking' && text.trim().length > 0) {
await agent.stopSpeaking();
}
});
The stopSpeaking() method cancels TTS playback, disconnects any Live TTS WebSocket, and transitions the agent back to listening. The pipeline then processes the user’s new utterance normally.
When pauseCaptureOnPlayback is 'auto' with 'conservative': Whether barge-in is available depends on the STT provider. With DeepgramSTT (which supports echo cancellation via MediaDevices), the microphone stays active and barge-in works. With NativeSTT, the microphone is paused and barge-in is not available.
Configuration examples
Default (recommended starting point):
const agent = new CompositeVoice({
stt, llm, tts,
// turnTaking is optional -- these are the defaults:
turnTaking: {
pauseCaptureOnPlayback: 'auto',
autoStrategy: 'conservative',
alwaysPauseCombinations: [
{ stt: 'NativeSTT', tts: 'NativeTTS' },
{ stt: 'NativeSTT', tts: 'any' },
],
},
});
Production Deepgram pipeline with full-duplex:
import { CompositeVoice, DeepgramSTT, AnthropicLLM, DeepgramTTS } from '@lukeocodes/composite-voice';
const agent = new CompositeVoice({
stt: new DeepgramSTT({
apiKey: 'your-deepgram-key',
options: { model: 'nova-3', interimResults: true, endpointing: 300 },
}),
llm: new AnthropicLLM({
apiKey: 'your-anthropic-key',
model: 'claude-haiku-4-5-20251001',
systemPrompt: 'You are a helpful voice assistant.',
maxTokens: 200,
}),
tts: new DeepgramTTS({
apiKey: 'your-deepgram-key',
options: { model: 'aura-2-thalia-en', encoding: 'linear16', sampleRate: 24000 },
}),
turnTaking: {
pauseCaptureOnPlayback: false, // full-duplex -- DeepgramSTT handles echo cancellation
},
});
// Barge-in: stop the agent if the user starts speaking
agent.on('transcription.interim', async ({ text }) => {
if (agent.getState() === 'speaking' && text.trim().length > 0) {
await agent.stopSpeaking();
}
});
Always-safe mode for unknown environments:
const agent = new CompositeVoice({
stt, llm, tts,
turnTaking: {
pauseCaptureOnPlayback: true, // always pause -- no echo, no barge-in
},
});
How the SDK decides
The decision flow for pauseCaptureOnPlayback: 'auto' is:
pauseCaptureOnPlayback === true? → PAUSE (always)
pauseCaptureOnPlayback === false? → CONTINUE (full-duplex)
pauseCaptureOnPlayback === 'auto'?
└─ autoStrategy: 'conservative'
│ └─ STT uses MediaDevices with echo cancellation? → CONTINUE
│ └─ STT uses SpeechRecognition? → PAUSE
│
└─ autoStrategy: 'aggressive'
│ └─ STT+TTS in alwaysPauseCombinations? → PAUSE
│ └─ otherwise → CONTINUE
│
└─ autoStrategy: 'detect'
└─ Browser supports echo cancellation constraints
│ AND STT uses MediaDevices? → CONTINUE
└─ otherwise → PAUSE
The SDK logs its decision at the debug log level. Enable debug logging to see the reasoning:
const agent = new CompositeVoice({
stt, llm, tts,
logging: { enabled: true, level: 'debug' },
});
// Console: "Turn-taking: Auto mode with conservative strategy (DeepgramSTT + DeepgramTTS)"
// Console: "Turn-taking: Conservative - CONTINUE (DeepgramSTT uses mediadevices, echo cancellation: supported)"
Related
- Pipeline Architecture for how turn-taking fits into the overall voice pipeline
- Getting Started for basic agent configuration
- DeepgramSTT guide for WebSocket-based STT with echo cancellation support
- NativeSTT guide for the Web Speech API provider and its limitations