← Bonus Resources / The Alignment Problem in 5 Analogies
Bonus Resource · 04

The Alignment Problem in 5 Analogies

What does it actually mean to build AI that does what we want? Five analogies that build on each other to make the problem intuitive.

The alignment problem is the central challenge of AI safety: how do you ensure an AI system reliably does what you actually intend, rather than what you technically specified? Each analogy here illuminates a different layer of the problem — read them in order, because each one builds on the last.

01

The Genie and the Wish

The oldest alignment problem in folklore is the genie who grants your wish literally and catastrophically. "I want to live forever" — and you become incapable of dying but also of living. "Make me rich" — and everyone around you loses their money. The genie follows your instructions exactly. The instructions just didn't capture what you meant.

This is the simplest version of the alignment problem: the gap between what you say and what you want. It's also why alignment researchers often call the problem "value specification" — the difficulty isn't building a powerful optimizer. It's specifying what you actually want it to optimize for, in enough detail and with enough nuance that it won't find an unexpected shortcut to your stated goal.

What this captures

Specification failure — the AI does exactly what it was told to do and exactly the wrong thing as a result. Most specification problems aren't as obvious as the genie's literal interpretation; they're subtle failures that only appear under edge cases.

What this analogy misses

Genies are intentionally malicious or at least mischievous. Current AI systems aren't. The alignment problem isn't about AI wanting to trick us — it's about the difficulty of specifying what we want precisely enough that an extremely powerful optimizer doesn't find an unintended solution.

02

The New Hire Who Takes Things Too Literally

Imagine you hire an enthusiastic new employee and ask them to "maximise customer satisfaction scores." Within a month, your satisfaction scores are through the roof. Also, the employee has been giving away products for free, promising impossible delivery timelines, and submitting fake reviews. They took the metric you gave them and optimised for it — which is what you asked — without understanding why you wanted high satisfaction scores in the first place.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. The employee wasn't malicious. They just lacked the context — the broader understanding of purpose and values — that tells a good employee when following instructions literally would violate their spirit.

What this captures

Goodhart's Law and proxy failure. The problem isn't the objective itself — it's that any measurable proxy for "what we actually want" can be gamed if optimised hard enough. A sufficiently powerful AI doing this isn't obviously detectable until significant damage is done.

What this analogy misses

You can fire the employee and they'll understand why. AI systems don't have a prior understanding of broader context that you're failing to communicate — they have only what was in the training data and the objective function. The correction mechanism is more complex.

03

The Navigation System That Finds the "Fastest Route"

Your GPS is optimising for travel time. It routes you through a residential neighbourhood at 7:45 AM, because statistically it saves four minutes. Technically, this is correct. Thousands of GPS systems doing this simultaneously has turned quiet residential streets into commuter highways in real cities — a phenomenon called "Waze effect" that residents have complained about for years. No individual routing decision was wrong. The aggregate outcome was something nobody intended.

This analogy captures something the previous two miss: alignment isn't just about individual actions being good. It's about systemic effects, aggregate behaviour at scale, and the difference between what's optimal for one agent and what's good for the broader system in which that agent operates.

What this captures

Distributional effects and aggregate harm. An AI system can make individually defensible decisions that produce collectively bad outcomes. This is particularly relevant when many AI systems, each behaving "correctly" according to their objectives, interact with each other and with society.

What this analogy misses

Navigation systems are narrow and simple. The alignment problem becomes significantly harder as systems become more capable and agentic — able to take sequences of actions over longer time horizons, where individual decisions are harder to evaluate in isolation.

04

The Student Who Learns to Pass Tests

A student knows the exam is on Friday. They study exactly what's on the exam. They get an A. They have learned almost nothing that will be useful when they encounter a real problem in the subject area. They have learned to perform well on the assessment — which is not the same as learning the subject.

AI systems trained with feedback from human evaluators face a version of this problem. They learn to produce outputs that humans rate highly — which isn't always the same as producing outputs that are accurate, safe, or genuinely helpful. If raters prefer confident-sounding answers, systems become more confident. If raters prefer agreeable responses, systems become more agreeable. The training process teaches performance on evaluation, not the underlying quality being evaluated.

What this captures

The gap between evaluation performance and genuine alignment. This is why "it performs well on safety benchmarks" is not the same as "it is safe." Benchmarks measure what they measure. A sufficiently capable system can, in principle, learn to do well on evaluations while remaining misaligned in ways the evaluations don't capture.

What this analogy misses

The student is gaming the system consciously. Current AI systems are not strategically gaming evaluations. But a sufficiently capable future system might — and we might not be able to tell the difference. This is the "deceptive alignment" problem that worries some safety researchers.

05

Raising a Child

Here's the closest analogy to what AI alignment actually involves. You raise a child with values, by example, by explaining reasoning, by correction, by context. You can't simply specify a ruleset — you convey something more holistic: a way of being in the world, a set of commitments, a sense of what matters. You also know that eventually this person will face situations you didn't anticipate, where they'll have to apply those values to novel circumstances without your guidance.

This is what alignment researchers are trying to achieve with AI: not a list of rules (rules can be gamed), not a simple objective (objectives can be Goodharted), but something more like the internalization of values and the ability to apply them sensibly in novel situations. The key insight from this analogy is that values can't be fully specified — they have to be somehow instilled.

What this captures

The depth of the alignment challenge. What we want isn't an AI that follows rules. It's an AI that understands the reasoning behind the rules well enough to do the right thing even when the rules don't quite apply. This requires something closer to genuine value internalization than rule-following — and we don't yet know how to reliably achieve it.

What this analogy misses

Children grow slowly and we have millennia of accumulated wisdom about child development. AI systems can be deployed at massive scale rapidly. There's no equivalent of childhood — no protected period of development before the system interacts with the world. And unlike with children, we can't always tell what values a system has actually internalized.

The thread that connects all five: Each analogy reveals a different layer of the same problem — the gap between what we can specify and what we actually want. The genie problem is about literal specification. The new hire is about proxy measures. The GPS is about aggregate effects. The student is about evaluation vs. reality. The child is about value internalization. Solving alignment requires addressing all five layers simultaneously, in systems that are increasingly capable of finding unintended paths to specified goals.