Why I Built a Tiny LLM Instead of Just Using One
For the past couple of years, I've been shipping production features built on top of GPT-4 and Claude. Retrieval pipelines, structured output extractors, multi-step agents — the usual toolkit. And for most of that time, I was comfortable treating the model as a black box with an API. You send tokens in, tokens come out, you iterate on your prompt until the output stops being embarrassing.
That worked fine, until it didn't.
I kept running into problems I couldn't reason about properly. Why did the model perform worse when I moved instructions from the system prompt to the user turn? Why did adding more examples to a prompt sometimes hurt accuracy? Why was a long context window supposed to be better, but my retrieval quality degraded when I actually used most of it? I was debugging prompts empirically — tweak, re-run, guess — without a real model of what was happening underneath.
So I did what I do when I feel like I'm operating machinery I don't understand: I built a small version of it myself.