What You Need to Know About LLMs

 

From Tokens to Open Weight: What You Actually Need to Know About LLMs

Or: How I learned to stop worrying about the marketing and understand the machine

You've heard the terms. Parameters. Tokens. Context windows. Agentic scores. Open-weight. They get thrown around in every AI headline, but few people stop to explain what they actually mean — and more importantly, how they work together.

This post distills a recent deep-dive session into a clear, practical guide. No fluff. Just the mental models you need to make real decisions about which models to use, what hardware you need, and what "open" actually means.


Part 1: The Two Things Everything Else Depends On

Tokens: The Currency of Language

token is how a model sees text. It's rarely a whole word — more often a piece of a word, a space, or a punctuation mark.

Example sentence: "The cast is sitting on the mat?"

That's 8–9 tokens, depending on the tokenizer:

  • "The" (1)
  • " cast" (1)
  • " is" (1)
  • " sitting" (1) — or " sit" + "ting" (2)
  • " on" (1)
  • " the" (1)
  • " mat" (1)
  • "?" (1)

Tokens are what you pay for with API calls. They're what context windows measure. They're the raw material.

Parameters: The Memory That Doesn't Forget

parameter is a learned number inside the neural network — a weight or bias that determines how input (tokens) transforms into output (answers).

GPT-3: 175 billion parameters.
LLaMA 70B: 70 billion parameters.
Your fine-tuned model: whatever you can fit on your GPU.

Crucially: Parameters are fixed after training. They don't change when you prompt the model. They are the model's permanent knowledge. Tokens are the temporary question.

Tokens

Parameters

What is it?

Pieces of text

Learned numbers

Changes per prompt?

Yes

No

How many in a typical query?

Hundreds to thousands

Billions (fixed)


Part 2: The Hardware Reality Check

Here's where theory meets physics. Parameters don't float in the cloud — they live on GPUs, and GPUs have limits.

The Memory Wall

A 70 billion parameter model stored in 16-bit precision needs 140 GB of GPU memory — just for the parameters. Add attention overhead, key-value caches, and you're easily at 200+ GB.

That's why you can't run LLaMA 70B on a single consumer GPU (which maxes at 24GB). You need:

  • Multiple enterprise GPUs (4× A100 at ~$40k total)
  • Or quantization (squeezing parameters into 4 bits = 35GB)
  • Or smaller models (7B parameters = 14GB, fits on an RTX 4090)

The Context Length Trap

A 128K context window sounds amazing — 300+ pages in one pass. But that 128K isn't free.

Attention scales as O(n²). At 100K tokens, the attention matrix alone can consume 20+ GB on top of your parameters. This is why many models advertise 128K but slow to a crawl — or crash — before you reach it.

Practical advice: Don't chase context length without checking whether the model actually maintains coherence at that length, especially for languages other than English (more on that below).


Part 3: Agentic Scores — How Independent Is the Model?

A model's agentic score measures how well it can act autonomously: plan, use tools, recover from errors, remember goals across steps.

This scales (roughly) with parameter count, but not linearly:

Size Class

Parameters

Agentic Score (0-100)

What it can do autonomously

Tiny

<1B

5-15

Single keyword extraction

Small

1-7B

20-35

1-2 step instructions

Medium

7-30B

40-60

Multi-step planning, tool use

Large

30-100B

55-75

Long-horizon, self-correction

Frontier

100B-2T

70-95

Sub-agent spawning, novel strategy

Real example: A small model (7B) needs explicit step‑by‑step prompts. A large model (70B) can handle "Plan my week's runs around the weather, and if it rains, reschedule intelligently" without hand‑holding.

Choose your model based on how much autonomy you actually need. For a simple FAQ bot, small is fine. For an AI research assistant that writes papers, you want large or frontier.


Part 4: 128K Context and the Language Problem

A 128K context window means nothing if the model loses its grasp of your language halfway through.

Research shows that many multilingual models suffer from language decay over long contexts:

Position in 128K

English Accuracy

Low‑Resource Language Accuracy

Start (0-10K)

95%

85%

Middle (40-60K)

92%

65% ↓

End (110-128K)

85%

25% ↓↓↓

What this means for you: If you're processing a 100‑page legal contract in Spanish, or a novel in Hindi, or a technical manual in Japanese, a 128K window only helps if the model was actually trained to handle that language at that distance.

Ask before you adopt: What is the model's coverage score for my language? How does it degrade over distance? Does the provider publish multilingual benchmarks?


Part 5: Open‑Weight — What You Actually Get

Open‑weight means the trained parameter files are released for you to download, run, fine‑tune, and deploy.

It is not open‑source. You rarely get the training code, almost never get the training data, and cannot reproduce the model from scratch.

What you can do with open‑weight models:

  • Run locally (privacy, no API fees)
  • Fine‑tune on your data (medical, legal, customer support)
  • Quantize to fit on cheaper hardware
  • Deploy commercially without per‑token costs
  • Audit for safety and bias

What you cannot do:

  • Reproduce the model from scratch
  • Know exactly what was in the training data (copyrighted books? private conversations?)
  • Fix fundamental architectural bugs (no training code)

Hardware trade‑off:

Model Size

Memory (16‑bit)

Minimum Hardware

7B

14 GB

RTX 4090 ($1600)

13B

26 GB

2× consumer GPUs or cloud instance

70B

140 GB

4× A100 ($40k) or quantize aggressively

Open‑weight doesn't mean free — it means you pay for hardware instead of paying per API call.


Part 6: Putting It All Together — A Decision Framework

When you evaluate a model, ask these questions in order:

  1. Tokens: How many do I need per query? (cost, latency)
  2. Parameters: Does the model size match my task complexity? (Don't use 70B for sentiment analysis.)
  3. Hardware: Can I actually run this? (Memory + compute)
  4. Agentic score: How much autonomy do I need? (Multi‑step tool use, or just Q&A?)
  5. Context + language: Does the model maintain accuracy for my language across the entire context window?
  6. Open‑weight vs API: Do I need privacy, fine‑tuning, and predictable costs? Or do I want convenience and zero hardware management?

Example decisions:

  • Simple chatbot for an English website: Small open‑weight model (7B) on a single GPU. Fine‑tune on your support tickets. Total cost: $2000 hardware + 10 hours.
  • Multilingual legal review (English + Spanish + French, 80‑page contracts): API model with proven 128K+ multilingual coverage (GPT‑4 Turbo or Gemini 1.5 Pro). Pay per document. No hardware headaches.
  • Research assistant that plans, executes, and reports: Large open‑weight model (70B) on cloud GPUs. Fine‑tune for your domain. Agentic score 70+.
  • Mobile app with privacy requirements: Tiny open‑weight model (<1B) running on device. Low agentic score, but zero data leaving the phone.

The Bottom Line

You don't need frontier models for most tasks. You don't need 128K context for most documents. And you definitely don't need to run 70B parameters locally if an API call costs less than your time debugging GPU drivers.

But when you do need control, privacy, or fine‑tuning — open‑weight models are the only real answer. Just be honest about the hardware they demand and the language coverage they actually deliver.

And always remember: tokens are what you ask. Parameters are what knows. Hardware is what pays. Choose accordingly.


This post was synthesized from a technical Q&A session. If you want to go deeper on any of these topics — quantization, agentic evaluation, or multilingual benchmarks — the conversation is open.

Comments