LLM Mastery for Enterprise AI Engineering / Beginner Track Module 4 / 5

LLM Mastery for Enterprise AI Engineering Beginner ⏱ 30 min

DEVQABAPMEXEC

Tokens and Tokenization

How tokenization affects cost, context windows, latency, multilingual behavior, and practical engineering decisions.

How to Use This Lesson

Start with the user problem, then map the pattern to architecture and failure modes.
If a code or design example is included, change one assumption and reason through the impact.
Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: How AI Models Work

Free · Subscriber Access

LLM Mastery for Enterprise AI Engineering

Free subscriber access. Enter your email to unlock all 18 modules, track your progress, and export your enterprise AI readiness packet.

Foundation to Advanced — tokens and transformers to deployment readiness and enterprise governance.
12 enterprise deliverables — data cards, eval reports, deployment reviews, governance packets.
Browser-local progress — your completion data stays private, no account needed.

LLM Mastery course page. This lesson is part 4 of 5 in the beginner track. Use the lab and assessment sections as the completion standard, not optional reading.

Required mastery artifact: by the end of this lesson, update the running enterprise readiness packet for a realistic use case. Treat examples and vendor names as dated illustrations; defend decisions with current model, cost, risk, and evaluation evidence.

03 — Tokens & Tokenization

Module 01 | Foundations

What is a Token?

An LLM doesn’t read text the way you do. It doesn’t read character by character either.

It reads tokens.

A token is a chunk of text — usually a word, part of a word, or a punctuation mark.

Think of it like this: if text is a pizza, tokens are the slices. Sometimes a slice is a whole word, sometimes it’s just a syllable, sometimes it’s punctuation.

"Hello, world!"
→ ["Hello", ",", " world", "!"]
→ 4 tokens
```

```
"Tokenization is fascinating"
→ ["Token", "ization", " is", " fasci", "nating"]
→ 5 tokens

Why Not Just Use Letters? Or Words?

Great question. Let’s think through it.

Option 1: Character by character

”cat” → [‘c’, ‘a’, ‘t’] → 3 units
Pro: Simple, small vocabulary
Con: The model needs to learn that “c-a-t” means cat from scratch. Very long sequences. Hard to learn long-range patterns.

Option 2: Word by word

”cats” and “cat” are different words, but they’re related
The model would need a separate entry for every word form: run, runs, running, ran, runner…
English alone has 1 million+ words. Too many.

Option 3: Tokens (subword units) ✅

“running” → [“run”, “ning”] — two familiar pieces
The model can combine familiar pieces to understand new words
Vocabulary is manageable: ~50,000-150,000 tokens for most models
Works well across languages

This is the sweet spot. Most modern LLMs use subword tokenization.

How Tokenization Works: BPE

The most popular tokenization algorithm is called Byte Pair Encoding (BPE).

Here’s how it works conceptually:

Start with every character as its own token
Find the most common pair of adjacent tokens
Merge them into one new token
Repeat until you have your desired vocabulary size

Example:

Start: "l o w l o w e r l o w e s t"

Most common pair: "l o" → merge to "lo"
Now:    "lo w lo w e r lo w e s t"

Most common pair: "lo w" → merge to "low"
Now:    "low low e r low e s t"

And so on...
```

After millions of iterations on real text, you end up with a vocabulary of common words and word-parts.

---

## The Vocabulary

Each token gets assigned a unique **ID number**.

```
"Hello"    → 15496
"world"    → 995
"!"        → 0
" the"     → 262
" cat"     → 3797
```

When the model "reads" text, it converts everything to these numbers. When it "writes" text, it picks a number and converts it back.

This mapping is called the **vocabulary** or **tokenizer**.

---

## Practical Token Examples

Let's see how different text tokenizes. Using GPT-4's tokenizer (cl100k):

```
"Hello"          → 1 token
"Hello!"         → 2 tokens (Hello, !)
"Hello world"    → 2 tokens
"Tokenization"   → 2 tokens (Token, ization)
"AI"             → 1 token
"artificial"     → 2 tokens (art, ificial)
"intelligence"   → 2 tokens (intel, ligence)
```

Interesting patterns:
- Common short words = 1 token
- Rare or long words = multiple tokens
- Spaces are often part of the token that follows them

---

## Why This Matters for You as an Engineer

### 1. Cost
APIs charge by token, not by word.
```
"Explain machine learning to a 5-year-old in detail."
= ~11 tokens
= costs roughly 11/1,000,000 × $15 = very cheap

But if you send a 10-page PDF as text:
= ~8,000 tokens per page × 10 pages = 80,000 tokens input
= much more expensive

2. Context limits

Every model has a maximum token limit. You can’t exceed it.

GPT-4 Turbo: 128,000 tokens (~96,000 words)
Claude 3.5 Sonnet: 200,000 tokens (~150,000 words)
LLaMA 3 8B: 8,192 tokens (~6,000 words)

3. Counting tokens is not counting words

"The cat sat" = 3 words ≠ 3 tokens
(usually 3 tokens here, but not always)

"supercalifragilistic" = 1 word = 5+ tokens

4. Languages tokenize differently

English is very efficient. Other languages aren’t:

English: "Hello, how are you?" → ~5 tokens
Japanese: "こんにちは、元気ですか？" → ~10-15 tokens

This means:
- APIs are more expensive for non-English text
- Non-English models use context faster

5. Numbers tokenize strangely

"1234" → 1 token (common number)
"1234567" → 2-3 tokens (broken up)
"3.14159265" → 5+ tokens
```

This is WHY LLMs are bad at arithmetic. They see numbers as token chunks, not actual mathematical values.

---

## Common Tokenizers

| Model Family | Tokenizer | Vocabulary Size |
|-------------|-----------|----------------|
| GPT-3.5/4 | tiktoken (cl100k) | ~100,000 |
| LLaMA 1/2 | SentencePiece | ~32,000 |
| LLaMA 3 | tiktoken variant | ~128,000 |
| Claude | Anthropic custom | ~100,000+ |
| Mistral | SentencePiece | ~32,000 |

Bigger vocabulary = more tokens are single words = more efficient, but model needs more memory.

---

## Counting Tokens in Code

```python
# Using tiktoken (for OpenAI-style models)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
text = "Hello! How does tokenization work?"
tokens = enc.encode(text)

print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")

# Output:
# Token IDs: [15496, 0, 2650, 1587, 47058, 2815, 30]
# Token count: 7
# Decoded: ['Hello', '!', ' How', ' does', ' token', 'ization', ' work?']
```

```python
# Using Hugging Face tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
text = "Hello, how does tokenization work?"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Count: {len(ids)}")

Special Tokens

Models use special tokens for structure. You’ll see these everywhere:

Token	Meaning
`<	endoftext
`<s>`	Start of sequence
`</s>`	End of sequence
`[INST]`	Start of user instruction (LLaMA)
`[/INST]`	End of user instruction
`<	im_start
`<	im_end

These are how models know who is speaking — the user, the assistant, or the system.

Token Budget: A Practical Rule of Thumb

For rough estimates:

1 token ≈ 0.75 words (English)
1 token ≈ 4 characters (English)

1,000 tokens ≈ 750 words ≈ 1.5 pages
100,000 tokens ≈ 75,000 words ≈ a full novel

📝 Summary

Concept	Plain English
Token	A chunk of text (word, part-word, or punctuation) the model processes
Tokenizer	The tool that converts text ↔ token IDs
BPE	Algorithm that learns token boundaries from data
Vocabulary	The full list of all possible tokens the model knows
Context window	Maximum number of tokens a model can process at once
Special tokens	Structural tokens like “start of message”, “end of text”

🧠 Mental Model

Tokens are like Lego blocks of text. Words are broken into standard-sized blocks that the model can snap together and understand. Some words are one block, some are many blocks. The model speaks Lego, not English.

❌ Beginner Mistakes to Avoid

”Token count = word count” — Off by ~25-40%. Always use a tokenizer to count precisely.
”LLMs can’t handle long documents” — They can, within their context window. Split larger docs into chunks.
”All languages cost the same” — Non-English text uses significantly more tokens per concept.
”The model reads character by character” — No. It reads whole token chunks at once.
”I can save money by removing spaces” — Spaces are usually part of tokens. Removing them changes tokenization unpredictably.

🏋️ Exercise

Task: Explore tokenization hands-on.

Part 1: Use a visual tokenizer

Visit: https://platform.openai.com/tokenizer Or: https://huggingface.co/spaces/Xenova/the-tokenizer-playground

Try tokenizing:

Your full name
A paragraph in English
The same paragraph in another language (use Google Translate)
A URL
Some Python code
The number 3.14159265358979

Part 2: Count tokens programmatically

pip install tiktoken

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

texts = [
    "Hello world",
    "Supercalifragilistic",
    "こんにちは世界",  # Japanese: "Hello world"
    "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "3.14159265358979323846"
]

for text in texts:
    count = len(enc.encode(text))
    print(f"'{text[:30]}...' → {count} tokens")
```

**Think about:** Why does Japanese use more tokens? What does that mean for API costs?

---

*Next: 04 — Context Windows*