Harness Engineering 6 months ahead, in a few hours.

Module 00 — AI basics for developers Free

In this module

You will understand how an LLM predicts text without understanding it, why tokens cost money and how to optimize them, what separates a good prompt from good context, and why configuring the AI (the harness) pays off more than polishing your prompts.

0.1 — What is an LLM?

A Large Language Model (LLM) is a neural network trained on massive amounts of text to perform a fundamental task: predict the next token. When you type "The capital of France is", the model computes a probability distribution and determines that "Paris" is the most likely token. There is no internal database, no hidden search engine, just large-scale statistical prediction.

The autocomplete analogy

Your IDE's autocomplete analyzes the local context (imports, types, variables in scope) to suggest the next token. An LLM does the same thing, but its "scope" covers the entire knowledge absorbed during training: documentation, open source code, books, articles. Same principle, radically different scale.

Training in 3 phases

Training breaks down into distinct stages:

1. Pre-training, The model ingests terabytes of text (code, Wikipedia, books, forums) to learn the statistical patterns of language. This phase costs tens of millions of dollars in GPU compute.

2. Fine-tuning, The model is refined on more specific and higher-quality data.

3. RLHF (Reinforcement Learning from Human Feedback), Humans evaluate the model's responses, and it learns to produce useful and relevant results. This phase turns a raw text predictor into a conversational assistant.

The core technical mechanism is attention (Transformer architecture, Google, 2017). It lets the model "look at" every part of the input text simultaneously. If you write a 200-line function and the return type is defined at line 3, attention lets the model connect that information at line 200.

Key takeaways

• An LLM predicts the next token based on all previous tokens

• Training happens in 3 phases: pre-training, fine-tuning, RLHF

• The model does not "understand", it produces statistically likely text

• It's autocomplete at the scale of the internet, not a reasoning engine


0.2 — Tokens: the basic unit

A token is not a word. It's a fragment of text, sometimes a whole word (function), sometimes a piece of a word (des + er + ial + ization), sometimes a single character (a brace, a newline). The splitting process is called tokenization, performed by an algorithm (tokenizer) specific to each model family. Confusing tokens and words is a mistake that costs money, literally.

Why tokens matter for a developer

Three direct reasons:

  • Cost, You pay by usage in tokens, not words or characters. Every token sent (input) and generated (output) is billed.
  • Context window, If your model has a 200,000-token window, that's the limit of what it can "see" simultaneously, not 200,000 words.
  • Code is token-hungry, Braces, parentheses, indentation, long variable names: a 100-line code file easily represents 500 to 800 tokens.

Counting your tokens

Rough rule of thumb: 1 token ≈ 4 characters in English ≈ 3/4 of a word. French consumes 20-30% more tokens than English.

Example Tokens
function 1 token
désérialisation 5 tokens
100 lines of code ~500-800 tokens

OpenAI's tokenizer is available online (platform.openai.com/tokenizer). Getting into the habit of checking the token count of your prompts and context files is the first step toward efficient LLM usage.

Key takeaways

• A token is a fragment of text, not a word

• The tokenizer is specific to each model (Claude, GPT-4, etc.)

• Code consumes more tokens than natural text

• French consumes 20-30% more tokens than English

• Everything is billed in tokens: input + output


0.3 — The context window

The context window is the maximum amount of tokens an LLM can process in a single interaction. It's its working memory. Everything the model "knows" about your conversation, your code, and your instructions must fit inside this window. If you exceed the limit, the oldest information is truncated or ignored.

What fills the window

The window does not contain only your prompt. It includes everything:

Element Description
System prompt The model's base instructions
History All previous questions and answers
Injected context Files, documentation, search results
Response in progress Tokens generated by the model

When you use Claude Code and it reads 15 files from your project, those 15 files take up space. If your conversation has lasted 30 exchanges, all that history is there too. A project with its configuration files, modules, and tests, a few dozen files are enough to saturate the window.

The "lost in the middle" phenomenon

LLMs pay more attention to information at the beginning and end of the context, and tend to "forget" what sits in the middle. If you paste 50 files and the critical information is in the 25th, the model is likely to miss it. The selection and ordering of context are just as important as the content of the prompt.

Common window sizes (2026):

  • Claude Opus 4 / Sonnet 4: 200K tokens (standard), 1M tokens (extended)
  • GPT-4 Turbo: 128K tokens
  • Gemini: up to 1M tokens

Key takeaways

• The context window = the LLM's RAM, measured in tokens

• It includes everything: system prompt, history, injected files, response in progress

• More context does not mean a better answer, quality beats quantity

• Long conversations degrade quality (history saturates the window)


0.4 — Prompt vs Context

Most developers disappointed by AI make the same mistake: they focus on the prompt (the question asked) and neglect the context (all the information that accompanies that question). In practice, context is responsible for the majority of answer quality. A simple prompt in a rich context beats a sophisticated prompt in a poor context.

The context layers

An LLM's context works in layers, from the deepest to the most superficial:

1. System prompt, Base instructions that define the model's behavior

2. Persistent context, Configuration files like CLAUDE.md, project rules, conventions

3. Session context, Files read, search results, conversation history

4. User prompt, Your question or instruction

Each layer influences the response, but the deeper layers have more impact than the surface prompt. A well-written CLAUDE.md file improves all your future interactions, not just the next one.

The consultant analogy

The prompt is the question you ask a consultant. The context is the brief sent before the meeting: project documentation, existing code, technical constraints, history of decisions. Without a brief, you get a generic answer. With a complete brief, the same question produces a specific and actionable answer.

Key takeaways

• The prompt = your question. The context = everything else

• Context has more impact on quality than the prompt itself

• Investing in persistent context (CLAUDE.md, conventions) compounds

• The real skill is not prompt engineering, it's context engineering


0.5 — Temperature, top-p and generation settings

When an LLM generates the next token, it computes a probability distribution over its entire vocabulary. Generation settings control how the model picks among these candidates. Most developers use default values without understanding what they are leaving on the table.

Temperature

Temperature controls the level of randomness in token selection:

Temperature Behavior Recommended use
0 Deterministic, always the most likely token Code, tests, refactoring
0.3-0.5 Slight variability, good balance Exploratory refactoring
0.7-1.0 Creative, more diversity Brainstorming, writing
> 1.0 Erratic, unlikely tokens picked Rarely useful

Other settings

  • top-p (nucleus sampling), Truncates the options. A top-p of 0.9 only considers tokens whose cumulative probabilities reach 90%. Removes unlikely tokens while preserving diversity.
  • max_tokens, Maximum length of the response. Too low = response cut off in the middle of a code block.
  • top-k, Limits the choice to the k most likely tokens.
  • frequency_penalty, Penalizes tokens already used to avoid repetitions.

In practice, with Claude Code in agentic mode, these settings are managed automatically. But when you use the API directly or configure a tool like Cursor, understanding these settings makes the difference.

Key takeaways

• Temperature is the creativity/reliability dial, for code, keep it low

• top-p and temperature are complementary, you rarely tune both at once

• max_tokens too low = truncated response, too high = unnecessary cost

• Claude Code optimizes these settings automatically in agentic mode


0.6 — The limits of LLMs

Understanding the limits of a tool is what separates the developer who uses it effectively from the one who is constantly disappointed. An LLM is not an oracle. It's a tool with predictable blind spots, and knowing them lets you work around them.

Hallucinations

An LLM can generate false information with absolute confidence: invent a nonexistent API, cite fictional parameters of a real library, produce code that uses imaginary functions. The model does not "verify" anything, it generates statistically likely text. The risk increases on topics sparsely represented in training data: recent libraries, internal APIs, proprietary code.

Limited reasoning

LLMs excel at pattern matching but struggle with complex multi-step reasoning. Tracing the execution of a recursive algorithm across 15 levels of depth exceeds their statistical prediction capabilities. It's not formal reasoning, it's pattern recognition.

Other structural limits

  • No persistent memory, Every conversation starts from scratch (except for persistence mechanisms like CLAUDE.md)
  • Cutoff date, The model doesn't know about events after its training
  • Data bias, Over-representation of certain languages (Python, JavaScript), frameworks (React, Spring), architectural approaches
  • No execution, Cannot test, compile, or verify its own output. That's why agentic tools like Claude Code are powerful: they fill this gap

Key takeaways

• Hallucinations are inherent, always verify APIs and function signatures

• Multi-step reasoning is fragile, break down complex problems

• No memory between sessions without explicit configuration

• Agentic tools compensate for the lack of execution (testing, compiling, iterating)


0.7 — The AI tooling ecosystem for devs

The market for AI developer tools evolves every week. GitHub Copilot, Cursor, Claude Code, Windsurf, Codeium, Amazon Q Developer, the profusion is real. Understanding the fundamental differences between these tools helps you pick the one that fits your workflow.

Three integration types

Type Examples Characteristic
IDE Copilot, Cursor, Windsurf Integrates into your editor
CLI Claude Code, Gemini CLI, Codex CLI Runs from the terminal
Web ChatGPT, Claude.ai Chat interface

Three interaction paradigms

1. Inline completion (Copilot, Tabnine, Codeium), Predicts the continuation of your code as you type. Advanced autocomplete, at line or block level.

2. Integrated chat (Copilot Chat, Cursor Chat), Questions and modification requests in a chat window next to the code.

3. Agentic (Claude Code, Cursor Composer, Windsurf Cascade), Autonomous actions: reading files, running commands, modifying code, running tests, iterating.

Underlying model vs tool

A commonly confused point: the tool and the model are two separate choices. Cursor can use GPT-4 or Claude. Copilot uses OpenAI models. Claude Code exclusively uses Claude models. The quality of the integration, what context the tool sends, how it handles errors, how it orchestrates actions, often matters more than the raw model.

Key takeaways

• Three paradigms: inline completion, integrated chat, agentic

• The underlying model and the tool are two separate decisions

• Integration quality (context, orchestration) matters as much as the model

• This course focuses on Claude Code and the agentic paradigm


0.8 — Inline completion vs Agent

Inline completion and agentic mode are two fundamentally different paradigms. This distinction determines the kind of tasks you can delegate, the level of autonomy granted to the AI, and the productivity multiplier you can expect.

Inline completion

You type code, the AI suggests the continuation in real time. You accept (Tab) or reject it. Control is total: you stay the pilot. The gain is linear, between 10 and 30% less code typed. The AI works at the line or block level, never at the project level.

Agentic mode

You describe a high-level intent:

Implémente un repository pour les offres retail
en suivant le même pattern que le repository existant
pour les partenariats.

The agent explores your codebase, creates the necessary files, writes the code, runs the tests, fixes errors, and iterates. You are no longer the pilot, you are the sponsor. 2-hour tasks get done in 10 minutes.

The shift in posture

Inline completion Agentic
Role Assisted executor Sponsor
Key skill Speed + evaluating suggestions Formulating intents + verification
Level Syntactic (text completion) Semantic (intent understanding)
Risk Low (one line at a time) High if misconfigured (15 files modified)

The more autonomy the AI has, the more costly errors can be. An agent that modified 15 files based on a misunderstanding of your architecture creates a disaster in 30 seconds. That's why configuration (guardrails, conventions, context files) is crucial in agentic mode.

Key takeaways

• Inline completion: 10-30% gain, total control, line/block level

• Agentic: order-of-magnitude gain, but requires a well-configured environment

• Moving from one to the other is a paradigm shift, not an incremental evolution

• More autonomy = more risk without proper configuration


0.9 — The concept of Harness Engineering

The term comes from "harness". An LLM in agentic mode is a powerful draft horse. Without a harness, that power is uncontrollable. With a well-fitted harness, it becomes a precise and directed pulling force. Harness Engineering is the art of designing and adjusting that harness.

The concept is described in detail by Martin Fowler.

Configuration > individual prompts

Harness Engineering rests on a conviction: time spent configuring the AI's work environment has a higher return on investment than time spent polishing prompts. A prompt is a one-shot use. A configuration is a multiplier, it improves every future interaction. Writing a complete CLAUDE.md file takes 30 minutes, but it improves thousands of interactions over the lifetime of a project.

The components of the harness

The harness is made of 4 elements:

  • Configuration files (CLAUDE.md, .cursorrules): persistent context, architecture, conventions, patterns
  • Hooks and guardrails: linting, tests, compilation after every agent modification
  • Templates and examples: reference files to reproduce the right pattern
  • Scope rules: what the agent can and cannot modify

Prompt Engineering vs Harness Engineering

Prompt Engineering optimizes individual interactions, it's tactical. Harness Engineering optimizes the work environment, it's strategic. A good Prompt Engineer gets good answers. A good Harness Engineer gets good answers by default, without extra effort on every interaction.

Key takeaways

• Harness = the harness itself. Agentic AI needs to be guided, not prompted

• Configuration compounds (30 min of config > hours of repetitive prompts)

• The 4 components: context files, hooks, templates, scope rules

• Prompt Engineering = tactical. Harness Engineering = strategic


0.10 — The cost of AI

AI for developers is not free. If you do not understand how pricing works, you risk underusing the tool out of fear of cost, or receiving an unpleasant bill at the end of the month. This sub-module demystifies the economic model.

Pricing per token

Pricing relies on two metrics: input tokens and output tokens. Output tokens cost significantly more.

Model Input Output
Claude Sonnet 4 $3/M $15/M
Claude Opus 4 $15/M $75/M
Claude Haiku $0.25/M $1.25/M

The output/input ratio is roughly 5x. Good context (input) that enables a concise response (output) is more economical than thin context that forces long explanations.

Real cost in practice

With Claude Code, every interaction consumes tokens: files read, codebase searches, command results analyzed. An intensive agentic development session on a large project consumes between 5 and 20 dollars per day. In perspective: if that session saves you 4 hours at €80/h, the ROI is 20 to 1.

Optimization strategies

1. Right model for the task, Sonnet for daily work, Opus for complex work

2. Short conversations, A new conversation is more efficient than a long one whose history costs more and more

3. Good persistent context, A well-written CLAUDE.md avoids re-explaining the context in every session

4. Subscription if heavy usage, Claude Max ($100 or $200/month) can be cheaper than pay-as-you-go

Key takeaways

• Output tokens cost ~5x more than input tokens

• An intensive Claude Code session: $5-20/day

• ROI is obvious for a developer whose hourly cost exceeds €50

• Optimize through model choice, conversation length, and persistent context


0.11 — Security and privacy

When you use an LLM, your code leaves your machine. That's normal operation, not a bug. Understanding exactly what is sent, where, and how it is processed is not paranoia, it's professional responsibility.

What is transmitted

With Claude Code, every interaction sends data to Anthropic's servers via the API: your prompt, the files the agent reads, the results of executed commands, the conversation history. Anthropic contractually commits (API policy) not to use this data to train its models. But the transmission itself is a risk vector.

The three concrete risks

1. Secrets sent accidentally, a .env file with AWS keys, a hardcoded OAuth token, a database password. If the agent reads them, they are transmitted.

2. Personal data, test fixtures with real emails, names, addresses: potential GDPR violation.

3. Intellectual property, proprietary code, business algorithms, business logic sent to a third-party service.

Best practices

Example .claudeignore file to prevent the agent from reading sensitive files:

.env
.env.*
credentials/
secrets/
**/fixtures/production/

Baseline rules:

  • Never put secrets in code, use environment variables + a secrets manager
  • Anonymized test fixtures
  • Regularly review session logs

At the organizational level: define a clear policy on AI tool usage before problems arise. For sensitive cases, options exist: VPC deployment with Anthropic, filtering proxy, or local models (LLaMA, Mistral).

Key takeaways

• Every interaction sends data to the provider's servers

• Real risks: secrets, personal data (GDPR), intellectual property

.claudeignore is your first line of defense

• Define a clear team policy before the incident, not after


0.12 — Choosing your model

Not all models are equal, and the most capable is not always the best choice. At Anthropic, the Claude lineup comes in three tiers: Opus (the most capable), Sonnet (the balanced one), and Haiku (the fastest). Picking the right model for the right task is one of the simplest and most impactful optimizations.

The Claude lineup

Model Strengths Input price Output price Usage
Opus 4 Complex reasoning, architecture, deep bugs $15/M $75/M ~10-20% of tasks
Sonnet 4 Versatile, good intelligence/cost balance $3/M $15/M ~70-80% of tasks
Haiku Ultra-fast, economical $0.25/M $1.25/M Simple tasks, pipelines

When to use what

  • Opus: subtle bug involving 10 modules, architectural refactor, design decision with complex trade-offs
  • Sonnet: standard code generation, writing tests, medium-sized refactoring, code review
  • Haiku: formatting, boilerplate, syntactic transformations, high-volume automated pipelines

Claude Code supports model selection mid-session:

Claude Code
/model sonnet        # Modèle par défaut recommandé
/model opus          # Pour les tâches complexes
/model opus[1m]      # Opus avec fenêtre étendue à 1M tokens
/model sonnet[1m]    # Sonnet avec fenêtre étendue à 1M tokens

Beyond Anthropic: GPT-4o (OpenAI), Gemini (Google, window up to 1M tokens), LLaMA and Mistral (open source, locally deployable). The choice depends on your constraints around cost, privacy, performance, and integration.

Effort level

Beyond model choice, Claude Code lets you control the effort level the model invests in each response via the /effort command. Three levels are available:

Level Behavior Usage
low Short and direct answers, less reasoning Simple questions, confirmations
medium Balanced approach, standard implementation Everyday development
high Deep analysis, thorough exploration Architecture, complex bugs
max Maximum effort, deepest reasoning Critical problems, structuring decisions
Claude Code
/effort high    # Analyse approfondie
/effort medium  # Équilibre (défaut)
/effort low     # Réponses rapides et économiques

Important point: Anthropic has lowered the default effort level from high to medium. In practice, medium is enough for most development tasks. Reserving high for moments when you need deep reasoning saves tokens and yields faster answers day-to-day.

Key takeaways

• Sonnet by default (70-80% of tasks), Opus for complex work, Haiku for volume

• Opus costs 5x more than Sonnet, use it when complexity justifies it

• Claude Code lets you switch model mid-session with /model

/effort controls reasoning depth: low, medium (default), high

• The "best model" depends on the task, not an absolute benchmark

Quiz — Check your learnings

Question 1 / 6
01.

Which Anthropic model is recommended by default for 70-80% of everyday development tasks?