What You Need to Know About LLMs

From Tokens to Open Weight: What You Actually Need to Know About LLMs

Or: How I learned to stop worrying about the marketing and understand the machine

You've heard the terms. Parameters. Tokens. Context windows. Agentic scores. Open-weight. They get thrown around in every AI headline, but few people stop to explain what they actually mean — and more importantly, how they work together.

This post distills a recent deep-dive session into a clear, practical guide. No fluff. Just the mental models you need to make real decisions about which models to use, what hardware you need, and what "open" actually means.

Part 1: The Two Things Everything Else Depends On

Tokens: The Currency of Language

A token is how a model sees text. It's rarely a whole word — more often a piece of a word, a space, or a punctuation mark.

Example sentence: "The cast is sitting on the mat?"

That's 8–9 tokens, depending on the tokenizer:

"The" (1)
" cast" (1)
" is" (1)
" sitting" (1) — or " sit" + "ting" (2)
" on" (1)
" the" (1)
" mat" (1)
"?" (1)

Tokens are what you pay for with API calls. They're what context windows measure. They're the raw material.

Parameters: The Memory That Doesn't Forget

A parameter is a learned number inside the neural network — a weight or bias that determines how input (tokens) transforms into output (answers).

GPT-3: 175 billion parameters.
LLaMA 70B: 70 billion parameters.
Your fine-tuned model: whatever you can fit on your GPU.

Crucially: Parameters are fixed after training. They don't change when you prompt the model. They are the model's permanent knowledge. Tokens are the temporary question.

	Tokens	Parameters
What is it?	Pieces of text	Learned numbers
Changes per prompt?	Yes	No
How many in a typical query?	Hundreds to thousands	Billions (fixed)

Part 2: The Hardware Reality Check

Here's where theory meets physics. Parameters don't float in the cloud — they live on GPUs, and GPUs have limits.

The Memory Wall

A 70 billion parameter model stored in 16-bit precision needs 140 GB of GPU memory — just for the parameters. Add attention overhead, key-value caches, and you're easily at 200+ GB.

That's why you can't run LLaMA 70B on a single consumer GPU (which maxes at 24GB). You need:

Multiple enterprise GPUs (4× A100 at ~$40k total)
Or quantization (squeezing parameters into 4 bits = 35GB)
Or smaller models (7B parameters = 14GB, fits on an RTX 4090)

The Context Length Trap

A 128K context window sounds amazing — 300+ pages in one pass. But that 128K isn't free.

Attention scales as O(n²). At 100K tokens, the attention matrix alone can consume 20+ GB on top of your parameters. This is why many models advertise 128K but slow to a crawl — or crash — before you reach it.

Practical advice: Don't chase context length without checking whether the model actually maintains coherence at that length, especially for languages other than English (more on that below).

Part 3: Agentic Scores — How Independent Is the Model?

A model's agentic score measures how well it can act autonomously: plan, use tools, recover from errors, remember goals across steps.

This scales (roughly) with parameter count, but not linearly:

Size Class	Parameters	Agentic Score (0-100)	What it can do autonomously
Tiny	<1B	5-15	Single keyword extraction
Small	1-7B	20-35	1-2 step instructions
Medium	7-30B	40-60	Multi-step planning, tool use
Large	30-100B	55-75	Long-horizon, self-correction
Frontier	100B-2T	70-95	Sub-agent spawning, novel strategy

Real example: A small model (7B) needs explicit step‑by‑step prompts. A large model (70B) can handle "Plan my week's runs around the weather, and if it rains, reschedule intelligently" without hand‑holding.

Choose your model based on how much autonomy you actually need. For a simple FAQ bot, small is fine. For an AI research assistant that writes papers, you want large or frontier.

Part 4: 128K Context and the Language Problem

A 128K context window means nothing if the model loses its grasp of your language halfway through.

Research shows that many multilingual models suffer from language decay over long contexts:

Position in 128K	English Accuracy	Low‑Resource Language Accuracy
Start (0-10K)	95%	85%
Middle (40-60K)	92%	65% ↓
End (110-128K)	85%	25% ↓↓↓

What this means for you: If you're processing a 100‑page legal contract in Spanish, or a novel in Hindi, or a technical manual in Japanese, a 128K window only helps if the model was actually trained to handle that language at that distance.

Ask before you adopt: What is the model's coverage score for my language? How does it degrade over distance? Does the provider publish multilingual benchmarks?

Part 5: Open‑Weight — What You Actually Get

Open‑weight means the trained parameter files are released for you to download, run, fine‑tune, and deploy.

It is not open‑source. You rarely get the training code, almost never get the training data, and cannot reproduce the model from scratch.

What you can do with open‑weight models:

Run locally (privacy, no API fees)
Fine‑tune on your data (medical, legal, customer support)
Quantize to fit on cheaper hardware
Deploy commercially without per‑token costs
Audit for safety and bias

What you cannot do:

Reproduce the model from scratch
Know exactly what was in the training data (copyrighted books? private conversations?)
Fix fundamental architectural bugs (no training code)

Hardware trade‑off:

Model Size	Memory (16‑bit)	Minimum Hardware
7B	14 GB	RTX 4090 ($1600)
13B	26 GB	2× consumer GPUs or cloud instance
70B	140 GB	4× A100 ($40k) or quantize aggressively

Open‑weight doesn't mean free — it means you pay for hardware instead of paying per API call.

Part 6: Putting It All Together — A Decision Framework

When you evaluate a model, ask these questions in order:

Tokens: How many do I need per query? (cost, latency)
Parameters: Does the model size match my task complexity? (Don't use 70B for sentiment analysis.)
Hardware: Can I actually run this? (Memory + compute)
Agentic score: How much autonomy do I need? (Multi‑step tool use, or just Q&A?)
Context + language: Does the model maintain accuracy for my language across the entire context window?
Open‑weight vs API: Do I need privacy, fine‑tuning, and predictable costs? Or do I want convenience and zero hardware management?

Example decisions:

Simple chatbot for an English website: Small open‑weight model (7B) on a single GPU. Fine‑tune on your support tickets. Total cost: $2000 hardware + 10 hours.
Multilingual legal review (English + Spanish + French, 80‑page contracts): API model with proven 128K+ multilingual coverage (GPT‑4 Turbo or Gemini 1.5 Pro). Pay per document. No hardware headaches.
Research assistant that plans, executes, and reports: Large open‑weight model (70B) on cloud GPUs. Fine‑tune for your domain. Agentic score 70+.
Mobile app with privacy requirements: Tiny open‑weight model (<1B) running on device. Low agentic score, but zero data leaving the phone.

The Bottom Line

You don't need frontier models for most tasks. You don't need 128K context for most documents. And you definitely don't need to run 70B parameters locally if an API call costs less than your time debugging GPU drivers.

But when you do need control, privacy, or fine‑tuning — open‑weight models are the only real answer. Just be honest about the hardware they demand and the language coverage they actually deliver.

And always remember: tokens are what you ask. Parameters are what knows. Hardware is what pays. Choose accordingly.

This post was synthesized from a technical Q&A session. If you want to go deeper on any of these topics — quantization, agentic evaluation, or multilingual benchmarks — the conversation is open.

Known Public Domain - Bytes

Search This Blog

What You Need to Know About LLMs

Comments

Post a Comment