From Tokens to Open Weight: What You Actually Need to Know
About LLMs
Or: How I learned to stop worrying about the marketing and
understand the machine
You've heard the terms. Parameters. Tokens. Context
windows. Agentic scores. Open-weight. They get thrown around in every AI
headline, but few people stop to explain what they actually mean — and more
importantly, how they work together.
This post distills a recent deep-dive session into a clear,
practical guide. No fluff. Just the mental models you need to make real
decisions about which models to use, what hardware you need, and what
"open" actually means.
Part
1: The Two Things Everything Else Depends On
Tokens:
The Currency of Language
A token is how a model sees text. It's
rarely a whole word — more often a piece of a word, a space, or a punctuation
mark.
Example sentence: "The cast is sitting on the mat?"
That's 8–9 tokens, depending on the tokenizer:
- "The" (1)
- " cast" (1)
- " is" (1)
- " sitting" (1) — or " sit" + "ting" (2)
- " on" (1)
- " the" (1)
- " mat" (1)
- "?" (1)
Tokens are what you pay for with API calls. They're what
context windows measure. They're the raw material.
Parameters:
The Memory That Doesn't Forget
A parameter is a learned number inside the
neural network — a weight or bias that determines how input (tokens) transforms
into output (answers).
GPT-3: 175 billion parameters.
LLaMA 70B: 70 billion parameters.
Your fine-tuned model: whatever you can fit on your GPU.
Crucially: Parameters are fixed after training.
They don't change when you prompt the model. They are the model's permanent
knowledge. Tokens are the temporary question.
|
Tokens |
Parameters |
|
|
What is it? |
Pieces of text |
Learned numbers |
|
Changes per prompt? |
Yes |
No |
|
How many in a typical query? |
Hundreds to thousands |
Billions (fixed) |
Part
2: The Hardware Reality Check
Here's where theory meets physics. Parameters don't float
in the cloud — they live on GPUs, and GPUs have limits.
The
Memory Wall
A 70 billion parameter model stored in 16-bit precision
needs 140 GB of GPU memory — just for the parameters. Add
attention overhead, key-value caches, and you're easily at 200+ GB.
That's why you can't run LLaMA 70B on a single consumer GPU
(which maxes at 24GB). You need:
- Multiple enterprise GPUs (4× A100
at ~$40k total)
- Or quantization (squeezing
parameters into 4 bits = 35GB)
- Or smaller models (7B parameters
= 14GB, fits on an RTX 4090)
The
Context Length Trap
A 128K context window sounds amazing — 300+ pages in one
pass. But that 128K isn't free.
Attention scales as O(n²). At 100K tokens, the
attention matrix alone can consume 20+ GB on top of your parameters. This is
why many models advertise 128K but slow to a crawl — or crash — before you
reach it.
Practical advice: Don't chase context length without checking whether
the model actually maintains coherence at that length, especially for languages
other than English (more on that below).
Part
3: Agentic Scores — How Independent Is the Model?
A model's agentic score measures how well
it can act autonomously: plan, use tools, recover from errors, remember goals
across steps.
This scales (roughly) with parameter count, but not
linearly:
|
Size Class |
Parameters |
Agentic Score (0-100) |
What it can do autonomously |
|
Tiny |
<1B |
5-15 |
Single keyword extraction |
|
Small |
1-7B |
20-35 |
1-2 step
instructions |
|
Medium |
7-30B |
40-60 |
Multi-step planning, tool use |
|
Large |
30-100B |
55-75 |
Long-horizon,
self-correction |
|
Frontier |
100B-2T |
70-95 |
Sub-agent spawning, novel strategy |
Real example: A small model (7B) needs explicit step‑by‑step
prompts. A large model (70B) can handle "Plan my week's runs around the
weather, and if it rains, reschedule intelligently" without hand‑holding.
Choose your model based on how much autonomy you actually
need. For a simple FAQ bot, small is fine. For an AI research assistant that
writes papers, you want large or frontier.
Part
4: 128K Context and the Language Problem
A 128K context window means nothing if the model loses its
grasp of your language halfway through.
Research shows that many multilingual models suffer
from language decay over long contexts:
|
Position in 128K |
English Accuracy |
Low‑Resource Language Accuracy |
|
Start (0-10K) |
95% |
85% |
|
Middle (40-60K) |
92% |
65% ↓ |
|
End (110-128K) |
85% |
25% ↓↓↓ |
What this means for you: If you're processing a 100‑page
legal contract in Spanish, or a novel in Hindi, or a technical manual in
Japanese, a 128K window only helps if the model was actually trained to handle
that language at that distance.
Ask before you adopt: What is the model's coverage
score for my language? How does it degrade over distance? Does the provider
publish multilingual benchmarks?
Part
5: Open‑Weight — What You Actually Get
Open‑weight means the trained parameter files are released for
you to download, run, fine‑tune, and deploy.
It is not open‑source. You rarely get the training
code, almost never get the training data, and cannot reproduce the model from
scratch.
What
you can do with open‑weight models:
- Run locally (privacy, no API
fees)
- Fine‑tune on your data (medical,
legal, customer support)
- Quantize to fit on cheaper
hardware
- Deploy commercially without per‑token
costs
- Audit for safety and bias
What
you cannot do:
- Reproduce the model from scratch
- Know exactly what was in the
training data (copyrighted books? private conversations?)
- Fix fundamental architectural
bugs (no training code)
Hardware
trade‑off:
|
Model Size |
Memory (16‑bit) |
Minimum Hardware |
|
7B |
14 GB |
RTX 4090 ($1600) |
|
13B |
26 GB |
2× consumer GPUs
or cloud instance |
|
70B |
140 GB |
4× A100 ($40k) or quantize
aggressively |
Open‑weight doesn't mean free — it means you pay for
hardware instead of paying per API call.
Part
6: Putting It All Together — A Decision Framework
When you evaluate a model, ask these questions in order:
- Tokens: How many do I need per
query? (cost, latency)
- Parameters: Does the model size match
my task complexity? (Don't use 70B for sentiment analysis.)
- Hardware: Can I actually run this?
(Memory + compute)
- Agentic score: How much autonomy do I
need? (Multi‑step tool use, or just Q&A?)
- Context + language: Does the model maintain
accuracy for my language across the entire context window?
- Open‑weight vs API: Do I need privacy, fine‑tuning,
and predictable costs? Or do I want convenience and zero hardware
management?
Example
decisions:
- Simple chatbot for an English
website: Small open‑weight model (7B) on a single GPU. Fine‑tune on
your support tickets. Total cost: $2000 hardware + 10 hours.
- Multilingual legal review
(English + Spanish + French, 80‑page contracts): API model with proven 128K+
multilingual coverage (GPT‑4 Turbo or Gemini 1.5 Pro). Pay per document.
No hardware headaches.
- Research assistant that plans,
executes, and reports: Large open‑weight model (70B) on cloud GPUs. Fine‑tune for
your domain. Agentic score 70+.
- Mobile app with privacy
requirements: Tiny open‑weight model (<1B) running on device. Low agentic
score, but zero data leaving the phone.
The
Bottom Line
You don't need frontier models for most tasks. You don't
need 128K context for most documents. And you definitely don't need to run 70B
parameters locally if an API call costs less than your time debugging GPU
drivers.
But when you do need control, privacy, or fine‑tuning —
open‑weight models are the only real answer. Just be honest about the hardware
they demand and the language coverage they actually deliver.
And always remember: tokens are what you ask. Parameters
are what knows. Hardware is what pays. Choose accordingly.
This post was synthesized from a technical Q&A session.
If you want to go deeper on any of these topics — quantization, agentic
evaluation, or multilingual benchmarks — the conversation is open.
Comments
Post a Comment