Getting Started with WebLLM
Learn how to run large language models directly in the browser using WebGPU. No server required!
Prerequisites
- Basic JavaScript knowledge
- Familiarity with async/await
Getting Started with WebLLM
WebLLM enables running large language models directly in your browser using WebGPU acceleration. This means you can build AI-powered applications without sending data to external servers—everything runs locally on the user's device.
Prerequisites
Before we begin, make sure you have:
- A browser that supports WebGPU (Chrome 113+ or Edge 113+)
- Basic knowledge of JavaScript and async/await
- Node.js installed for the development setup
What is WebLLM?
WebLLM is a library developed by the MLC AI team that brings large language models to the browser. It uses:
- WebGPU: A new web standard for GPU-accelerated computation
- MLC-LLM: Machine Learning Compilation for efficient model execution
- Quantized Models: Compressed model weights that fit in browser memory
Step 1: Setting Up Your Project
First, create a new project and install WebLLM:
1# Create a new Vite project
2npm create vite@latest webllm-demo -- --template vanilla-ts
3cd webllm-demo
4
5# Install WebLLM
6npm install @mlc-ai/web-llmStep 2: Basic WebLLM Setup
Create a simple chat interface. In your main.ts:
1import * as webllm from "@mlc-ai/web-llm";
2
3// Available models - smaller models load faster
4const MODEL_ID = "SmolLM2-360M-Instruct-q4f16_1-MLC";
5
6class WebLLMChat {
7 private engine: webllm.MLCEngine | null = null;
8
9 async initialize(onProgress: (progress: string) => void) {
10 // Create the engine
11 this.engine = new webllm.MLCEngine();
12
13 // Set up progress callback
14 this.engine.setInitProgressCallback((report) => {
15 onProgress(`Loading: ${report.text} (${Math.round(report.progress * 100)}%)`);
16 });
17
18 // Load the model
19 await this.engine.reload(MODEL_ID);
20 onProgress("Ready!");
21 }
22
23 async chat(message: string): Promise<string> {
24 if (!this.engine) throw new Error("Engine not initialized");
25
26 const response = await this.engine.chat.completions.create({
27 messages: [{ role: "user", content: message }],
28 temperature: 0.7,
29 max_tokens: 256,
30 });
31
32 return response.choices[0].message.content || "";
33 }
34}
35
36// Usage
37const chat = new WebLLMChat();
38
39document.querySelector<HTMLDivElement>("#app")!.innerHTML = `
40 <div>
41 <h1>WebLLM Chat</h1>
42 <div id="status">Initializing...</div>
43 <input type="text" id="input" placeholder="Type a message..." disabled />
44 <button id="send" disabled>Send</button>
45 <div id="response"></div>
46 </div>
47`;
48
49const statusEl = document.getElementById("status")!;
50const inputEl = document.getElementById("input") as HTMLInputElement;
51const sendBtn = document.getElementById("send") as HTMLButtonElement;
52const responseEl = document.getElementById("response")!;
53
54// Initialize
55chat.initialize((status) => {
56 statusEl.textContent = status;
57 if (status === "Ready!") {
58 inputEl.disabled = false;
59 sendBtn.disabled = false;
60 }
61});
62
63// Handle send
64sendBtn.addEventListener("click", async () => {
65 const message = inputEl.value;
66 if (!message) return;
67
68 sendBtn.disabled = true;
69 responseEl.textContent = "Thinking...";
70
71 const response = await chat.chat(message);
72 responseEl.textContent = response;
73
74 sendBtn.disabled = false;
75 inputEl.value = "";
76});Step 3: Adding Streaming Responses
For a better user experience, stream the response token by token:
1async chatStream(
2 message: string,
3 onToken: (token: string) => void
4): Promise<void> {
5 if (!this.engine) throw new Error("Engine not initialized");
6
7 const response = await this.engine.chat.completions.create({
8 messages: [{ role: "user", content: message }],
9 temperature: 0.7,
10 max_tokens: 256,
11 stream: true, // Enable streaming
12 });
13
14 // Iterate over the stream
15 for await (const chunk of response) {
16 const token = chunk.choices[0]?.delta?.content || "";
17 if (token) {
18 onToken(token);
19 }
20 }
21}
22
23// Usage with streaming
24let fullResponse = "";
25await chat.chatStream(message, (token) => {
26 fullResponse += token;
27 responseEl.textContent = fullResponse;
28});Step 4: Handling Model Loading
Model loading can take time (downloading ~200MB-2GB depending on the model). Here's how to provide good feedback:
1interface LoadingState {
2 stage: "downloading" | "caching" | "initializing" | "ready";
3 progress: number;
4 text: string;
5}
6
7function parseProgress(report: webllm.InitProgressReport): LoadingState {
8 const text = report.text.toLowerCase();
9
10 if (text.includes("fetch")) {
11 return {
12 stage: "downloading",
13 progress: report.progress,
14 text: "Downloading model weights..."
15 };
16 }
17
18 if (text.includes("cache")) {
19 return {
20 stage: "caching",
21 progress: report.progress,
22 text: "Caching model for faster loads..."
23 };
24 }
25
26 if (text.includes("loading") || text.includes("init")) {
27 return {
28 stage: "initializing",
29 progress: report.progress,
30 text: "Initializing model..."
31 };
32 }
33
34 return {
35 stage: "ready",
36 progress: 1,
37 text: "Ready!"
38 };
39}Step 5: Conversation Memory
For multi-turn conversations, maintain message history:
1interface Message {
2 role: "user" | "assistant" | "system";
3 content: string;
4}
5
6class ConversationalChat {
7 private engine: webllm.MLCEngine | null = null;
8 private messages: Message[] = [];
9
10 // Add system prompt
11 setSystemPrompt(prompt: string) {
12 this.messages = [{ role: "system", content: prompt }];
13 }
14
15 async chat(userMessage: string): Promise<string> {
16 if (!this.engine) throw new Error("Engine not initialized");
17
18 // Add user message to history
19 this.messages.push({ role: "user", content: userMessage });
20
21 // Get response with full history
22 const response = await this.engine.chat.completions.create({
23 messages: this.messages,
24 temperature: 0.7,
25 max_tokens: 256,
26 });
27
28 const assistantMessage = response.choices[0].message.content || "";
29
30 // Add assistant response to history
31 this.messages.push({ role: "assistant", content: assistantMessage });
32
33 return assistantMessage;
34 }
35
36 clearHistory() {
37 // Keep system prompt if present
38 this.messages = this.messages.filter(m => m.role === "system");
39 }
40}Available Models
WebLLM supports various models. Smaller models are faster but less capable:
| Model | Size | Use Case |
|---|---|---|
| SmolLM2-360M-Instruct | ~200MB | Quick responses, simple tasks |
| Llama-3.2-1B-Instruct | ~600MB | Better quality, still fast |
| Llama-3.2-3B-Instruct | ~1.5GB | Good balance of speed/quality |
| Phi-3.5-mini-instruct | ~2GB | Strong reasoning |
Common Issues
WebGPU Not Available
1async function checkWebGPU(): Promise<boolean> {
2 if (!navigator.gpu) {
3 console.error("WebGPU not supported in this browser");
4 return false;
5 }
6
7 const adapter = await navigator.gpu.requestAdapter();
8 if (!adapter) {
9 console.error("No GPU adapter found");
10 return false;
11 }
12
13 return true;
14}Memory Issues
If the model fails to load, try a smaller model or check available GPU memory.
Next Steps
Now that you have basic WebLLM working:
- Add a proper UI: Build a chat interface with message history
- Implement error handling: Handle network issues and GPU errors gracefully
- Explore other models: Try different models for your use case
- Add features: Implement copy-to-clipboard, message editing, etc.
In the next tutorial, we'll build a complete React chat interface with WebLLM.
This tutorial is part of the WebLLM Fundamentals series.