0.1 — What is an LLM?
A Large Language Model (LLM) is a neural network trained on massive amounts of text to perform a fundamental task: predict the next token. When you type "The capital of France is", the model computes a probability distribution and determines that "Paris" is the most likely token. There is no internal database, no hidden search engine, just large-scale statistical prediction.
The autocomplete analogy
Your IDE's autocomplete analyzes the local context (imports, types, variables in scope) to suggest the next token. An LLM does the same thing, but its "scope" covers the entire knowledge absorbed during training: documentation, open source code, books, articles. Same principle, radically different scale.
Training in 3 phases
Training breaks down into distinct stages:
1. Pre-training, The model ingests terabytes of text (code, Wikipedia, books, forums) to learn the statistical patterns of language. This phase costs tens of millions of dollars in GPU compute.
2. Fine-tuning, The model is refined on more specific and higher-quality data.
3. RLHF (Reinforcement Learning from Human Feedback), Humans evaluate the model's responses, and it learns to produce useful and relevant results. This phase turns a raw text predictor into a conversational assistant.
The core technical mechanism is attention (Transformer architecture, Google, 2017). It lets the model "look at" every part of the input text simultaneously. If you write a 200-line function and the return type is defined at line 3, attention lets the model connect that information at line 200.
Key takeaways
• An LLM predicts the next token based on all previous tokens
• Training happens in 3 phases: pre-training, fine-tuning, RLHF
• The model does not "understand", it produces statistically likely text
• It's autocomplete at the scale of the internet, not a reasoning engine
0.2 — Tokens: the basic unit
A token is not a word. It's a fragment of text, sometimes a whole word (function), sometimes a piece of a word (des + er + ial + ization), sometimes a single character (a brace, a newline). The splitting process is called tokenization, performed by an algorithm (tokenizer) specific to each model family. Confusing tokens and words is a mistake that costs money, literally.
Why tokens matter for a developer
Three direct reasons:
- Cost, You pay by usage in tokens, not words or characters. Every token sent (input) and generated (output) is billed.
- Context window, If your model has a 200,000-token window, that's the limit of what it can "see" simultaneously, not 200,000 words.
- Code is token-hungry, Braces, parentheses, indentation, long variable names: a 100-line code file easily represents 500 to 800 tokens.
Counting your tokens
Rough rule of thumb: 1 token ≈ 4 characters in English ≈ 3/4 of a word. French consumes 20-30% more tokens than English.
| Example | Tokens |
|---|---|
function |
1 token |
désérialisation |
5 tokens |
| 100 lines of code | ~500-800 tokens |
OpenAI's tokenizer is available online (platform.openai.com/tokenizer). Getting into the habit of checking the token count of your prompts and context files is the first step toward efficient LLM usage.
Key takeaways
• A token is a fragment of text, not a word
• The tokenizer is specific to each model (Claude, GPT-4, etc.)
• Code consumes more tokens than natural text
• French consumes 20-30% more tokens than English
• Everything is billed in tokens: input + output
0.3 — The context window
The context window is the maximum amount of tokens an LLM can process in a single interaction. It's its working memory. Everything the model "knows" about your conversation, your code, and your instructions must fit inside this window. If you exceed the limit, the oldest information is truncated or ignored.
What fills the window
The window does not contain only your prompt. It includes everything:
| Element | Description |
|---|---|
| System prompt | The model's base instructions |
| History | All previous questions and answers |
| Injected context | Files, documentation, search results |
| Response in progress | Tokens generated by the model |
When you use Claude Code and it reads 15 files from your project, those 15 files take up space. If your conversation has lasted 30 exchanges, all that history is there too. A project with its configuration files, modules, and tests, a few dozen files are enough to saturate the window.
The "lost in the middle" phenomenon
LLMs pay more attention to information at the beginning and end of the context, and tend to "forget" what sits in the middle. If you paste 50 files and the critical information is in the 25th, the model is likely to miss it. The selection and ordering of context are just as important as the content of the prompt.
Common window sizes (2026):
- Claude Opus 4 / Sonnet 4: 200K tokens (standard), 1M tokens (extended)
- GPT-4 Turbo: 128K tokens
- Gemini: up to 1M tokens
Key takeaways
• The context window = the LLM's RAM, measured in tokens
• It includes everything: system prompt, history, injected files, response in progress
• More context does not mean a better answer, quality beats quantity
• Long conversations degrade quality (history saturates the window)
0.4 — Prompt vs Context
Most developers disappointed by AI make the same mistake: they focus on the prompt (the question asked) and neglect the context (all the information that accompanies that question). In practice, context is responsible for the majority of answer quality. A simple prompt in a rich context beats a sophisticated prompt in a poor context.
The context layers
An LLM's context works in layers, from the deepest to the most superficial:
1. System prompt, Base instructions that define the model's behavior
2. Persistent context, Configuration files like CLAUDE.md, project rules, conventions
3. Session context, Files read, search results, conversation history
4. User prompt, Your question or instruction
Each layer influences the response, but the deeper layers have more impact than the surface prompt. A well-written CLAUDE.md file improves all your future interactions, not just the next one.
The consultant analogy
The prompt is the question you ask a consultant. The context is the brief sent before the meeting: project documentation, existing code, technical constraints, history of decisions. Without a brief, you get a generic answer. With a complete brief, the same question produces a specific and actionable answer.
Key takeaways
• The prompt = your question. The context = everything else
• Context has more impact on quality than the prompt itself
• Investing in persistent context (CLAUDE.md, conventions) compounds
• The real skill is not prompt engineering, it's context engineering
0.5 — Temperature, top-p and generation settings
When an LLM generates the next token, it computes a probability distribution over its entire vocabulary. Generation settings control how the model picks among these candidates. Most developers use default values without understanding what they are leaving on the table.
Temperature
Temperature controls the level of randomness in token selection:
| Temperature | Behavior | Recommended use |
|---|---|---|
| 0 | Deterministic, always the most likely token | Code, tests, refactoring |
| 0.3-0.5 | Slight variability, good balance | Exploratory refactoring |
| 0.7-1.0 | Creative, more diversity | Brainstorming, writing |
| > 1.0 | Erratic, unlikely tokens picked | Rarely useful |
Other settings
- top-p (nucleus sampling), Truncates the options. A top-p of 0.9 only considers tokens whose cumulative probabilities reach 90%. Removes unlikely tokens while preserving diversity.
- max_tokens, Maximum length of the response. Too low = response cut off in the middle of a code block.
- top-k, Limits the choice to the k most likely tokens.
- frequency_penalty, Penalizes tokens already used to avoid repetitions.
In practice, with Claude Code in agentic mode, these settings are managed automatically. But when you use the API directly or configure a tool like Cursor, understanding these settings makes the difference.
Key takeaways
• Temperature is the creativity/reliability dial, for code, keep it low
• top-p and temperature are complementary, you rarely tune both at once
• max_tokens too low = truncated response, too high = unnecessary cost
• Claude Code optimizes these settings automatically in agentic mode
0.6 — The limits of LLMs
Understanding the limits of a tool is what separates the developer who uses it effectively from the one who is constantly disappointed. An LLM is not an oracle. It's a tool with predictable blind spots, and knowing them lets you work around them.
Hallucinations
An LLM can generate false information with absolute confidence: invent a nonexistent API, cite fictional parameters of a real library, produce code that uses imaginary functions. The model does not "verify" anything, it generates statistically likely text. The risk increases on topics sparsely represented in training data: recent libraries, internal APIs, proprietary code.
Limited reasoning
LLMs excel at pattern matching but struggle with complex multi-step reasoning. Tracing the execution of a recursive algorithm across 15 levels of depth exceeds their statistical prediction capabilities. It's not formal reasoning, it's pattern recognition.
Other structural limits
- No persistent memory, Every conversation starts from scratch (except for persistence mechanisms like
CLAUDE.md) - Cutoff date, The model doesn't know about events after its training
- Data bias, Over-representation of certain languages (Python, JavaScript), frameworks (React, Spring), architectural approaches
- No execution, Cannot test, compile, or verify its own output. That's why agentic tools like Claude Code are powerful: they fill this gap
Key takeaways
• Hallucinations are inherent, always verify APIs and function signatures
• Multi-step reasoning is fragile, break down complex problems
• No memory between sessions without explicit configuration
• Agentic tools compensate for the lack of execution (testing, compiling, iterating)
0.7 — The AI tooling ecosystem for devs
The market for AI developer tools evolves every week. GitHub Copilot, Cursor, Claude Code, Windsurf, Codeium, Amazon Q Developer, the profusion is real. Understanding the fundamental differences between these tools helps you pick the one that fits your workflow.
Three integration types
| Type | Examples | Characteristic |
|---|---|---|
| IDE | Copilot, Cursor, Windsurf | Integrates into your editor |
| CLI | Claude Code, Gemini CLI, Codex CLI | Runs from the terminal |
| Web | ChatGPT, Claude.ai | Chat interface |
Three interaction paradigms
1. Inline completion (Copilot, Tabnine, Codeium), Predicts the continuation of your code as you type. Advanced autocomplete, at line or block level.
2. Integrated chat (Copilot Chat, Cursor Chat), Questions and modification requests in a chat window next to the code.
3. Agentic (Claude Code, Cursor Composer, Windsurf Cascade), Autonomous actions: reading files, running commands, modifying code, running tests, iterating.
Underlying model vs tool
A commonly confused point: the tool and the model are two separate choices. Cursor can use GPT-4 or Claude. Copilot uses OpenAI models. Claude Code exclusively uses Claude models. The quality of the integration, what context the tool sends, how it handles errors, how it orchestrates actions, often matters more than the raw model.
Key takeaways
• Three paradigms: inline completion, integrated chat, agentic
• The underlying model and the tool are two separate decisions
• Integration quality (context, orchestration) matters as much as the model
• This course focuses on Claude Code and the agentic paradigm
0.8 — Inline completion vs Agent
Inline completion and agentic mode are two fundamentally different paradigms. This distinction determines the kind of tasks you can delegate, the level of autonomy granted to the AI, and the productivity multiplier you can expect.
Inline completion
You type code, the AI suggests the continuation in real time. You accept (Tab) or reject it. Control is total: you stay the pilot. The gain is linear, between 10 and 30% less code typed. The AI works at the line or block level, never at the project level.
Agentic mode
You describe a high-level intent:
Implémente un repository pour les offres retail
en suivant le même pattern que le repository existant
pour les partenariats.
The agent explores your codebase, creates the necessary files, writes the code, runs the tests, fixes errors, and iterates. You are no longer the pilot, you are the sponsor. 2-hour tasks get done in 10 minutes.
The shift in posture
| Inline completion | Agentic | |
|---|---|---|
| Role | Assisted executor | Sponsor |
| Key skill | Speed + evaluating suggestions | Formulating intents + verification |
| Level | Syntactic (text completion) | Semantic (intent understanding) |
| Risk | Low (one line at a time) | High if misconfigured (15 files modified) |
The more autonomy the AI has, the more costly errors can be. An agent that modified 15 files based on a misunderstanding of your architecture creates a disaster in 30 seconds. That's why configuration (guardrails, conventions, context files) is crucial in agentic mode.
Key takeaways
• Inline completion: 10-30% gain, total control, line/block level
• Agentic: order-of-magnitude gain, but requires a well-configured environment
• Moving from one to the other is a paradigm shift, not an incremental evolution
• More autonomy = more risk without proper configuration
0.9 — The concept of Harness Engineering
The term comes from "harness". An LLM in agentic mode is a powerful draft horse. Without a harness, that power is uncontrollable. With a well-fitted harness, it becomes a precise and directed pulling force. Harness Engineering is the art of designing and adjusting that harness.
The concept is described in detail by Martin Fowler.
Configuration > individual prompts
Harness Engineering rests on a conviction: time spent configuring the AI's work environment has a higher return on investment than time spent polishing prompts. A prompt is a one-shot use. A configuration is a multiplier, it improves every future interaction. Writing a complete CLAUDE.md file takes 30 minutes, but it improves thousands of interactions over the lifetime of a project.
The components of the harness
The harness is made of 4 elements:
- Configuration files (CLAUDE.md, .cursorrules): persistent context, architecture, conventions, patterns
- Hooks and guardrails: linting, tests, compilation after every agent modification
- Templates and examples: reference files to reproduce the right pattern
- Scope rules: what the agent can and cannot modify
Prompt Engineering vs Harness Engineering
Prompt Engineering optimizes individual interactions, it's tactical. Harness Engineering optimizes the work environment, it's strategic. A good Prompt Engineer gets good answers. A good Harness Engineer gets good answers by default, without extra effort on every interaction.
Key takeaways
• Harness = the harness itself. Agentic AI needs to be guided, not prompted
• Configuration compounds (30 min of config > hours of repetitive prompts)
• The 4 components: context files, hooks, templates, scope rules
• Prompt Engineering = tactical. Harness Engineering = strategic
0.10 — The cost of AI
AI for developers is not free. If you do not understand how pricing works, you risk underusing the tool out of fear of cost, or receiving an unpleasant bill at the end of the month. This sub-module demystifies the economic model.
Pricing per token
Pricing relies on two metrics: input tokens and output tokens. Output tokens cost significantly more.
| Model | Input | Output |
|---|---|---|
| Claude Sonnet 4 | $3/M | $15/M |
| Claude Opus 4 | $15/M | $75/M |
| Claude Haiku | $0.25/M | $1.25/M |
The output/input ratio is roughly 5x. Good context (input) that enables a concise response (output) is more economical than thin context that forces long explanations.
Real cost in practice
With Claude Code, every interaction consumes tokens: files read, codebase searches, command results analyzed. An intensive agentic development session on a large project consumes between 5 and 20 dollars per day. In perspective: if that session saves you 4 hours at €80/h, the ROI is 20 to 1.
Optimization strategies
1. Right model for the task, Sonnet for daily work, Opus for complex work
2. Short conversations, A new conversation is more efficient than a long one whose history costs more and more
3. Good persistent context, A well-written CLAUDE.md avoids re-explaining the context in every session
4. Subscription if heavy usage, Claude Max ($100 or $200/month) can be cheaper than pay-as-you-go
Key takeaways
• Output tokens cost ~5x more than input tokens
• An intensive Claude Code session: $5-20/day
• ROI is obvious for a developer whose hourly cost exceeds €50
• Optimize through model choice, conversation length, and persistent context
0.11 — Security and privacy
When you use an LLM, your code leaves your machine. That's normal operation, not a bug. Understanding exactly what is sent, where, and how it is processed is not paranoia, it's professional responsibility.
What is transmitted
With Claude Code, every interaction sends data to Anthropic's servers via the API: your prompt, the files the agent reads, the results of executed commands, the conversation history. Anthropic contractually commits (API policy) not to use this data to train its models. But the transmission itself is a risk vector.
The three concrete risks
1. Secrets sent accidentally, a .env file with AWS keys, a hardcoded OAuth token, a database password. If the agent reads them, they are transmitted.
2. Personal data, test fixtures with real emails, names, addresses: potential GDPR violation.
3. Intellectual property, proprietary code, business algorithms, business logic sent to a third-party service.
Best practices
Example .claudeignore file to prevent the agent from reading sensitive files:
.env
.env.*
credentials/
secrets/
**/fixtures/production/
Baseline rules:
- Never put secrets in code, use environment variables + a secrets manager
- Anonymized test fixtures
- Regularly review session logs
At the organizational level: define a clear policy on AI tool usage before problems arise. For sensitive cases, options exist: VPC deployment with Anthropic, filtering proxy, or local models (LLaMA, Mistral).
Key takeaways
• Every interaction sends data to the provider's servers
• Real risks: secrets, personal data (GDPR), intellectual property
• .claudeignore is your first line of defense
• Define a clear team policy before the incident, not after
0.12 — Choosing your model
Not all models are equal, and the most capable is not always the best choice. At Anthropic, the Claude lineup comes in three tiers: Opus (the most capable), Sonnet (the balanced one), and Haiku (the fastest). Picking the right model for the right task is one of the simplest and most impactful optimizations.
The Claude lineup
| Model | Strengths | Input price | Output price | Usage |
|---|---|---|---|---|
| Opus 4 | Complex reasoning, architecture, deep bugs | $15/M | $75/M | ~10-20% of tasks |
| Sonnet 4 | Versatile, good intelligence/cost balance | $3/M | $15/M | ~70-80% of tasks |
| Haiku | Ultra-fast, economical | $0.25/M | $1.25/M | Simple tasks, pipelines |
When to use what
- Opus: subtle bug involving 10 modules, architectural refactor, design decision with complex trade-offs
- Sonnet: standard code generation, writing tests, medium-sized refactoring, code review
- Haiku: formatting, boilerplate, syntactic transformations, high-volume automated pipelines
Claude Code supports model selection mid-session:
/model sonnet # Modèle par défaut recommandé
/model opus # Pour les tâches complexes
/model opus[1m] # Opus avec fenêtre étendue à 1M tokens
/model sonnet[1m] # Sonnet avec fenêtre étendue à 1M tokens
Beyond Anthropic: GPT-4o (OpenAI), Gemini (Google, window up to 1M tokens), LLaMA and Mistral (open source, locally deployable). The choice depends on your constraints around cost, privacy, performance, and integration.
Effort level
Beyond model choice, Claude Code lets you control the effort level the model invests in each response via the /effort command. Three levels are available:
| Level | Behavior | Usage |
|---|---|---|
low |
Short and direct answers, less reasoning | Simple questions, confirmations |
medium |
Balanced approach, standard implementation | Everyday development |
high |
Deep analysis, thorough exploration | Architecture, complex bugs |
max |
Maximum effort, deepest reasoning | Critical problems, structuring decisions |
/effort high # Analyse approfondie
/effort medium # Équilibre (défaut)
/effort low # Réponses rapides et économiques
Important point: Anthropic has lowered the default effort level from high to medium. In practice, medium is enough for most development tasks. Reserving high for moments when you need deep reasoning saves tokens and yields faster answers day-to-day.
Key takeaways
• Sonnet by default (70-80% of tasks), Opus for complex work, Haiku for volume
• Opus costs 5x more than Sonnet, use it when complexity justifies it
• Claude Code lets you switch model mid-session with /model
• /effort controls reasoning depth: low, medium (default), high
• The "best model" depends on the task, not an absolute benchmark