The Smallest Capable Model I Could Find

"There's a flourishing ecosystem of powerful, closed models but equally capable open models that are going to be coming over the next couple years." — Misha Laskin, March 25, 2026

That is much closer to the idea I had in mind. Models like Kimi and Qwen already make that future feel visible. I wanted to build a tiny proof of it: a local LLM small enough to run in a browser tab, but impressive enough that people could feel what open models can already do for themselves.

I always wanted quick access to a local LLM right on my personal site — a small Easter egg that anyone can use for free, no account required. The challenge was finding the smallest model that could hold a real conversation and shipping it as a browser-native chatbot. No API keys, no cloud, no round trips. Everything runs in your browser.

I wanted visitors — especially engineers — to open the chat, ask something, and get a real answer back. From a model running on their own GPU. In a browser tab.

Why the smallest model

I could have picked a 3B or 7B model. The quality would be better. But I was optimizing for something else: speed, size, and reliability.

I started even smaller. Early versions of the assistant used SmolLM2-135M and then SmolLM2-360M. They were impressive for the size, but too brittle for the kind of conversational quality I wanted. Qwen2.5-0.5B ended up being the smallest model that still felt reliably capable.

A 0.5B model downloads in ~350 MB. It loads fast. On my M3 MacBook Air it generates 40–55 tokens per second. It does not crash the tab or eat 4 GB of VRAM. For a widget tucked into the corner of a personal site, those constraints matter more than benchmark scores.

The model is Qwen2.5-0.5B-Instruct — an instruction-tuned variant fine-tuned specifically for chat. It handles structured prompts, follows system instructions, and generates clean conversational responses. Not bad for half a billion parameters.

How it works

WebLLM compiles the model to run on WebGPU with flash attention, paged KV caches, and FP16 shaders. The weights download once, get cached in the browser, and every visit after that loads from local storage. If you want the engine details, MLC has a good technical write-up here: WebLLM: A High-Performance In-Browser LLM Inference Engine.

Inference runs in a Web Worker so the UI stays responsive. Tokens stream in one by one, just like any cloud chatbot — except the GPU doing the work is yours.

I shipped this one with WebLLM, but the broader browser-ML ecosystem deserves credit too. Transformers.js did a lot to make in-browser inference feel normal instead of niche, and the webml-community demos on Hugging Face are worth keeping around as a reference point for what browser-local WebGPU UX can look like in practice.

If I had to draw the line cleanly, I would put it like this: WebLLM is still the most ambitious, performance-oriented stack for chat-first local LLMs in the browser, while Transformers.js v4 is the more flexible and easier framework for getting new architectures working quickly. One feels like a specialized inference engine. The other feels like the broader browser AI standard library.

That distinction matters even more now that Gemma 4 already has a practical browser path through Hugging Face's ONNX + Transformers.js v4 stack, which they highlighted in their Gemma 4 launch post. If all I wanted was the fastest route to "Gemma 4 in the browser," I would probably start there. The reason I still care about the WebLLM route is different: I want to see whether a compiled MLC/WebLLM path can make the same model feel faster, tighter, and more native once it is fully working.

A model repo is not a browser runtime

One thing I had to learn the hard way is that "converting a model" does not mean training it again. It means translating it from the format the original authors published into the format your inference engine expects. In my case, that meant turning a normal Hugging Face model repo into something mlc-llm and WebLLM could actually execute inside a browser.

The easiest mental model is this: the converter is doing translation, compression, and packaging all at once.

First, you need a Python definition of the architecture itself: what layers exist, how tensors are shaped, and how the original weight names map into the new runtime. Then convert_weight takes the original checkpoint, remaps those tensors into the MLC layout, and usually quantizes them into something smaller and cheaper to run. Instead of one giant training checkpoint, you end up with browser-oriented parameter shards and a tensor-cache.json index.

After that, gen_config writes the runtime settings, and compile --device webgpu builds the actual WebGPU-targeted wasm library for that exact model. The final artifact is really a bundle: tokenizer files, converted weights, runtime config, and a compiled inference engine. That is why "the model compiles" and "the model actually works in a browser" are two different milestones. The first means the package exists. The second means the runtime survives real inference.

Making a tiny model feel smart

A 0.5B model does not know much about the world. But it can follow instructions precisely if you show it what to do. The system prompt uses few-shot examples — real question-answer pairs in the exact tone and length I want. That turns out to be the highest-leverage optimization for small models. You teach by demonstration instead of hoping the model figures it out.

The rest is UX. Streaming makes responses feel instant. A good empty state with rotating example prompts tells people what the chatbot is actually good at. Honest framing — "best for quick questions, rewrites, and tiny experiments" — sets the right expectation so the model can meet it.

Why not go bigger

I could upgrade to the 1.5B or 3B variant with one config change. The architecture supports it. But the point was never to ship the smartest chatbot. It was to ship the smallest one that still works — and make it feel fast, stable, and native to the site.

That constraint forced better engineering decisions than having more parameters would have.

Tips if you want to build one

Pick an instruction-tuned model. Base models are trained to complete text, not answer questions. The -Instruct variants are fine-tuned for chat and will follow your system prompt much more reliably. This matters a lot at small sizes.

Write your system prompt like a spec. Small models do not infer what you want. Tell them the exact length, tone, and format. Include 3–5 example Q&A pairs that demonstrate the output you expect. This one change will do more for quality than upgrading the model.

Set honest expectations. If the chatbot says "best for quick questions and tiny experiments," visitors calibrate accordingly. A short, accurate answer feels better than a long, wrong one. Frame the capability before the first message.

Use a Web Worker. Model inference blocks the thread it runs on. Without a worker, your entire UI freezes during generation. WebLLM supports WebWorkerMLCEngine out of the box — use it.

Cache the model weights. The first download is the expensive part. After that, WebLLM caches everything in the browser. Make sure your loading state communicates this — "first load downloads ~350 MB once" sets the right expectation. Return visits should feel instant.

Stream everything. Token-by-token streaming makes a 40 tok/s model feel conversational. Buffered responses that appear all at once feel slow even if the total time is identical. Perceived speed is real speed.

Cut the response, do not stretch it. Small models start strong and degrade. Two clean sentences are better than four where the last two repeat the first. Build a sentence budget into your streaming logic and stop the model when it hits it.

Try it

The chatbot is in the bottom corner. Everything runs locally. Nothing leaves your device.