Going down a random rabbit hole: From XML Tags to $100M Weight Updates

I started by asking a question: Why does Anthropic insist on XML tags? It sounded like a minor formatting preference. But as I dug into the engineering, it became clear that XML isn't just "nice to use"—it is the statistical anchor Claude’s 175B+ parameters were built to recognize.

1. The XML "Attention Fence"

Claude treats XML as its native interface because of how its Self-Attention heads were trained.

The Logic: During pre-training, Anthropic wrapped "untrusted" data in tags. This taught the model to mathematically lower the "weight" of tokens inside <user_query> when calculating the next token for a system instruction.
The Result: XML tags act as a structural fence, preventing Attention Contamination where the model might mistake your background data for a new command.

2. Phase 1: Supervised Fine-Tuning (SFT) — Behavioral Cloning

This is where the model learns the Constitution—a 60-point rulebook. It isn't "thinking" yet; it is mimicking.

The Goal: The model generates a response, critiques it against the Constitution, and revises it. We then train the model to clone that perfect revision.
The Math (Cross-Entropy Loss):
For every word, the model generates a probability vector P the size of its unique vocabulary (~100k tokens).
We compare P to the One-Hot Ground Truth Y—a vector of all zeros except for a 1 at the correct token's index.
The Update: Using Loss = -Sum( Y * log(P) ), we measure the gap. If the model predicted "Hello" (40%) but the truth was <thinking> (100%), Backpropagation adjusts the weights across the entire network to make <thinking> the statistically "correct" choice.

3. Phase 2: RLAIF & PPO — The Optimization Stage

Once the model knows how to speak, Phase 2 teaches it judgment using Reinforcement Learning from AI Feedback (RLAIF).

The PPO Objective Function:

Objective = Reward - Beta * KL_Divergence(New_Model || Old_Model)

The Reward: An AI "Judge" model scores billions of responses. If Claude follows the Constitution and uses XML correctly, it gets a "Reward" (a positive scalar). The weights involved in that "winning" logic are reinforced.
The KL Penalty (The Anchor): This is critical. If the model tries to "game" the reward by becoming a gibberish-machine that only says things the Judge likes, the KL Divergence penalty spikes. It forces the model to stay "tethered" to the stable language base from Phase 1.
PPO Clipping: Proximal Policy Optimization ensures the weights only move by a tiny bit (usually 20%) per update, preventing "model collapse" from outlier feedback.

4. Scaling: The $100M+ Infrastructure

The Bottleneck: Traditional RLHF is limited by human reading speed (~200 wpm).
The Scaling: RLAIF runs at the speed of the H100 GPU cluster. Anthropic can simulate 100 years of "human judgment" in a weekend. The massive cost isn't human labor—it's the electricity and compute required to run these recursive self-improvement loops.

The Takeaway: Using XML tags isn't just about being "neat." It is about aligning your prompt with the exact statistical patterns Claude’s weights were optimized to reward through millions of dollars of compute.

Going down a random rabbit hole: From XML Tags to $100M Weight Updates

Billy

1. The XML "Attention Fence"

2. Phase 1: Supervised Fine-Tuning (SFT) — Behavioral Cloning

3. Phase 2: RLAIF & PPO — The Optimization Stage

4. Scaling: The $100M+ Infrastructure

Topics