Getting Started with WebLLM

WebLLM enables running large language models directly in your browser using WebGPU acceleration. This means you can build AI-powered applications without sending data to external servers—everything runs locally on the user's device.

Prerequisites

Before we begin, make sure you have:

A browser that supports WebGPU (Chrome 113+ or Edge 113+)
Basic knowledge of JavaScript and async/await
Node.js installed for the development setup

What is WebLLM?

WebLLM is a library developed by the MLC AI team that brings large language models to the browser. It uses:

WebGPU: A new web standard for GPU-accelerated computation
MLC-LLM: Machine Learning Compilation for efficient model execution
Quantized Models: Compressed model weights that fit in browser memory

Step 1: Setting Up Your Project

First, create a new project and install WebLLM:

Bash

1# Create a new Vite project
2npm create vite@latest webllm-demo -- --template vanilla-ts
3cd webllm-demo
4
5# Install WebLLM
6npm install @mlc-ai/web-llm

Step 2: Basic WebLLM Setup

Create a simple chat interface. In your main.ts:

TypeScript

1import * as webllm from "@mlc-ai/web-llm";
2
3// Available models - smaller models load faster
4const MODEL_ID = "SmolLM2-360M-Instruct-q4f16_1-MLC";
5
6class WebLLMChat {
7  private engine: webllm.MLCEngine | null = null;
8
9  async initialize(onProgress: (progress: string) => void) {
10    // Create the engine
11    this.engine = new webllm.MLCEngine();
12
13    // Set up progress callback
14    this.engine.setInitProgressCallback((report) => {
15      onProgress(`Loading: ${report.text} (${Math.round(report.progress * 100)}%)`);
16    });
17
18    // Load the model
19    await this.engine.reload(MODEL_ID);
20    onProgress("Ready!");
21  }
22
23  async chat(message: string): Promise<string> {
24    if (!this.engine) throw new Error("Engine not initialized");
25
26    const response = await this.engine.chat.completions.create({
27      messages: [{ role: "user", content: message }],
28      temperature: 0.7,
29      max_tokens: 256,
30    });
31
32    return response.choices[0].message.content || "";
33  }
34}
35
36// Usage
37const chat = new WebLLMChat();
38
39document.querySelector<HTMLDivElement>("#app")!.innerHTML = `
40  <div>
41    <h1>WebLLM Chat</h1>
42    <div id="status">Initializing...</div>
43    <input type="text" id="input" placeholder="Type a message..." disabled />
44    <button id="send" disabled>Send</button>
45    <div id="response"></div>
46  </div>
47`;
48
49const statusEl = document.getElementById("status")!;
50const inputEl = document.getElementById("input") as HTMLInputElement;
51const sendBtn = document.getElementById("send") as HTMLButtonElement;
52const responseEl = document.getElementById("response")!;
53
54// Initialize
55chat.initialize((status) => {
56  statusEl.textContent = status;
57  if (status === "Ready!") {
58    inputEl.disabled = false;
59    sendBtn.disabled = false;
60  }
61});
62
63// Handle send
64sendBtn.addEventListener("click", async () => {
65  const message = inputEl.value;
66  if (!message) return;
67
68  sendBtn.disabled = true;
69  responseEl.textContent = "Thinking...";
70
71  const response = await chat.chat(message);
72  responseEl.textContent = response;
73
74  sendBtn.disabled = false;
75  inputEl.value = "";
76});

Step 3: Adding Streaming Responses

For a better user experience, stream the response token by token:

TypeScript

1async chatStream(
2  message: string,
3  onToken: (token: string) => void
4): Promise<void> {
5  if (!this.engine) throw new Error("Engine not initialized");
6
7  const response = await this.engine.chat.completions.create({
8    messages: [{ role: "user", content: message }],
9    temperature: 0.7,
10    max_tokens: 256,
11    stream: true,  // Enable streaming
12  });
13
14  // Iterate over the stream
15  for await (const chunk of response) {
16    const token = chunk.choices[0]?.delta?.content || "";
17    if (token) {
18      onToken(token);
19    }
20  }
21}
22
23// Usage with streaming
24let fullResponse = "";
25await chat.chatStream(message, (token) => {
26  fullResponse += token;
27  responseEl.textContent = fullResponse;
28});

Step 4: Handling Model Loading

Model loading can take time (downloading ~200MB-2GB depending on the model). Here's how to provide good feedback:

TypeScript

1interface LoadingState {
2  stage: "downloading" | "caching" | "initializing" | "ready";
3  progress: number;
4  text: string;
5}
6
7function parseProgress(report: webllm.InitProgressReport): LoadingState {
8  const text = report.text.toLowerCase();
9
10  if (text.includes("fetch")) {
11    return {
12      stage: "downloading",
13      progress: report.progress,
14      text: "Downloading model weights..."
15    };
16  }
17
18  if (text.includes("cache")) {
19    return {
20      stage: "caching",
21      progress: report.progress,
22      text: "Caching model for faster loads..."
23    };
24  }
25
26  if (text.includes("loading") || text.includes("init")) {
27    return {
28      stage: "initializing",
29      progress: report.progress,
30      text: "Initializing model..."
31    };
32  }
33
34  return {
35    stage: "ready",
36    progress: 1,
37    text: "Ready!"
38  };
39}

Step 5: Conversation Memory

For multi-turn conversations, maintain message history:

TypeScript

1interface Message {
2  role: "user" | "assistant" | "system";
3  content: string;
4}
5
6class ConversationalChat {
7  private engine: webllm.MLCEngine | null = null;
8  private messages: Message[] = [];
9
10  // Add system prompt
11  setSystemPrompt(prompt: string) {
12    this.messages = [{ role: "system", content: prompt }];
13  }
14
15  async chat(userMessage: string): Promise<string> {
16    if (!this.engine) throw new Error("Engine not initialized");
17
18    // Add user message to history
19    this.messages.push({ role: "user", content: userMessage });
20
21    // Get response with full history
22    const response = await this.engine.chat.completions.create({
23      messages: this.messages,
24      temperature: 0.7,
25      max_tokens: 256,
26    });
27
28    const assistantMessage = response.choices[0].message.content || "";
29
30    // Add assistant response to history
31    this.messages.push({ role: "assistant", content: assistantMessage });
32
33    return assistantMessage;
34  }
35
36  clearHistory() {
37    // Keep system prompt if present
38    this.messages = this.messages.filter(m => m.role === "system");
39  }
40}

Available Models

WebLLM supports various models. Smaller models are faster but less capable:

Model	Size	Use Case
SmolLM2-360M-Instruct	~200MB	Quick responses, simple tasks
Llama-3.2-1B-Instruct	~600MB	Better quality, still fast
Llama-3.2-3B-Instruct	~1.5GB	Good balance of speed/quality
Phi-3.5-mini-instruct	~2GB	Strong reasoning

Common Issues

WebGPU Not Available

TypeScript

1async function checkWebGPU(): Promise<boolean> {
2  if (!navigator.gpu) {
3    console.error("WebGPU not supported in this browser");
4    return false;
5  }
6
7  const adapter = await navigator.gpu.requestAdapter();
8  if (!adapter) {
9    console.error("No GPU adapter found");
10    return false;
11  }
12
13  return true;
14}

Memory Issues

If the model fails to load, try a smaller model or check available GPU memory.

Next Steps

Now that you have basic WebLLM working:

Add a proper UI: Build a chat interface with message history
Implement error handling: Handle network issues and GPU errors gracefully
Explore other models: Try different models for your use case
Add features: Implement copy-to-clipboard, message editing, etc.

In the next tutorial, we'll build a complete React chat interface with WebLLM.

This tutorial is part of the WebLLM Fundamentals series.