Skip to content

WebLLM (In-Browser)

Run LLMs entirely in the browser via WebGPU with the WebLLM provider.

Use WebLLMLLM when you need full privacy, offline capability, or zero server costs. WebLLM runs language models directly in the browser using WebGPU acceleration — no API key, no proxy, no network connection after the initial model download.

Prerequisites

  • A browser with WebGPU support (Chrome 113+, Edge 113+)
  • Sufficient GPU memory for the selected model
  • Install the peer dependency:
npm install @mlc-ai/web-llm

Basic setup

import { CompositeVoice, WebLLMLLM, NativeSTT, NativeTTS } from '@lukeocodes/composite-voice';

const agent = new CompositeVoice({
  stt: new NativeSTT({ language: 'en-US' }),
  llm: new WebLLMLLM({
    model: 'Llama-3.2-1B-Instruct-q4f16_1-MLC',
    systemPrompt: 'You are a concise voice assistant. Keep answers under two sentences.',
    onLoadProgress: ({ progress, text }) => {
      console.log(`Loading: ${Math.round(progress * 100)}% - ${text}`);
    },
  }),
  tts: new NativeTTS(),
});

await agent.start();

Configuration options

OptionTypeDefaultDescription
modelstring(required)WebLLM model identifier. Must match a supported MLC model ID.
systemPromptstringSystem-level instructions for the assistant.
temperaturenumberRandomness (0 = deterministic, 2 = creative).
maxTokensnumberMaximum tokens per response.
topPnumberNucleus sampling threshold (0—1).
streambooleantrueStream tokens incrementally.
onLoadProgressfunctionCallback for model download and shader compilation progress.
chatOptsobjectOverride MLC engine parameters (e.g., context_window_size).

Model examples

ModelSizeNotes
Llama-3.2-1B-Instruct-q4f16_1-MLC~600 MBSmall, fast. Good for basic voice tasks.
Llama-3.2-3B-Instruct-q4f16_1-MLC~1.8 GBBetter quality, needs more GPU memory.
Phi-3.5-mini-instruct-q4f16_1-MLC~2 GBMicrosoft Phi, good instruction following.
Mistral-7B-Instruct-v0.3-q4f16_1-MLC~4 GBLarger, higher quality.

See WebLLM’s model list for all supported models.

Complete example

import { CompositeVoice, WebLLMLLM, NativeSTT, NativeTTS } from '@lukeocodes/composite-voice';

const statusEl = document.getElementById('status')!;

const agent = new CompositeVoice({
  stt: new NativeSTT({ language: 'en-US' }),
  llm: new WebLLMLLM({
    model: 'Llama-3.2-1B-Instruct-q4f16_1-MLC',
    systemPrompt: 'You are a helpful local assistant. Answer briefly.',
    onLoadProgress: ({ progress, text }) => {
      statusEl.textContent = `Loading model: ${Math.round(progress * 100)}% - ${text}`;
    },
    chatOpts: {
      context_window_size: 2048,
    },
  }),
  tts: new NativeTTS(),
});

statusEl.textContent = 'Initializing...';
await agent.start();
statusEl.textContent = 'Ready - speak now';

Tips

  • First load downloads the model. Model weights are 100 MB or more. The browser caches them for subsequent loads, but the first visit takes significant time. Always wire onLoadProgress to a loading indicator.
  • Show a progress bar. The onLoadProgress callback provides progress (0—1), timeElapsed (seconds), and a human-readable text description.
  • WebGPU is required. If the user’s browser lacks WebGPU support, initialization fails. Check for navigator.gpu before creating the provider.
  • dispose() frees GPU memory. Call agent.stop() when the user leaves to release VRAM via engine.unload().
  • Abort uses engine.interruptGenerate(). Unlike server-side providers that cancel HTTP requests, WebLLM interrupts the local inference loop directly.
  • No API key or proxy needed. WebLLMLLM is the only LLM provider that runs without any server infrastructure.

© 2026 CompositeVoice. All rights reserved.

Font size
Contrast
Motion
Transparency