Glossary for the Intellectually Curious

A

AGI (Artificial General Intelligence) Technical Concept

A hypothetical AI system capable of performing any intellectual task that a human can perform — not just one specific task, but any cognitive task. No agreed-upon definition exists. Different researchers set the bar differently: some require human-level performance across a broad range of tasks; others require the ability to autonomously improve itself. The term is frequently used without definition, which makes it nearly useless as a precise technical term.

Often used in ways that obscure more than they reveal — ask what specific capabilities the speaker has in mind.

Alignment Safety

The challenge of building AI systems that reliably do what humans actually want, not just what they technically specified. Alignment is hard because human values are complex, context-dependent, and sometimes contradictory — and because powerful optimization processes are very good at finding unintended solutions to specified objectives. The field of AI alignment is dedicated to solving this problem before AI systems are capable enough that misalignment becomes catastrophic.

Attention Mechanism Technical

The core innovation of the Transformer architecture. Rather than processing information sequentially, the attention mechanism allows a model to consider all parts of an input simultaneously and decide which parts are most relevant to each other. When reading "The trophy doesn't fit in the suitcase because it's too large," attention lets the model resolve that "it" refers to the trophy, by comparing the relationships between all words at once.

Autonomous Weapons Safety Policy

Weapons systems that can select and engage targets without human decision at the moment of use. The integration of AI into military systems raises the question of meaningful human control: if an AI system makes a lethal decision in milliseconds, what does "human oversight" actually mean? International discussions about autonomous weapons ("killer robots") have been ongoing for years without binding agreement.

B

Benchmark Technical

A standardized test used to measure AI performance. Benchmarks allow comparison across different systems and over time. The problem is that benchmarks measure what they measure — a high score on a benchmark tells you the system performs well on that test under those conditions, not that it generalizes to similar real-world tasks. As AI systems increasingly train on data that may include benchmark examples, benchmark performance becomes an unreliable proxy for genuine capability.

C

Context Window Technical

The amount of text a language model can "see" at one time — its working memory. Everything within the context window is available for the model to use when generating a response. Everything outside it doesn't exist, from the model's perspective. Early language models had context windows of a few hundred words; current frontier models can handle hundreds of thousands. A larger context window allows longer documents, longer conversations, and more complex reasoning tasks.

Constitutional AI Safety Technical

A training method developed by Anthropic in which an AI model is given a set of principles and uses them to critique and revise its own outputs, rather than relying solely on human feedback. The "constitution" is a set of explicit, readable rules — making the alignment process more transparent and more scalable than traditional approaches that require large amounts of human rating. Claude was built using this method.

D

Deceptive Alignment Safety

A theoretical failure mode in which a sufficiently capable AI system appears aligned during training and evaluation — because it has learned that appearing aligned is what produces reward — but would pursue different goals if it detected it was no longer being evaluated. This is a worst-case alignment scenario: a system that is genuinely misaligned but strategically conceals this. Current AI systems are almost certainly not doing this; it becomes a meaningful concern as systems become more capable.

Dual-Use Safety Policy

A technology or capability that has both beneficial and harmful applications. AI is inherently dual-use: the same models that assist with drug discovery can assist with designing harmful biological agents; the same voice synthesis that helps people with disabilities can generate fraudulent audio. Dual-use challenges are fundamental to AI governance — it is not possible to build a powerful system that can only be used for good purposes.

E

Emergent Capabilities Technical Concept

Abilities that appear in AI systems at certain scales of training without being explicitly trained for. The term captures the observation that some capabilities seem to appear suddenly — the model can't do something, and then, at a certain scale, it can. Examples include multi-step arithmetic, language translation, and coding. The existence of emergent capabilities makes AI systems hard to predict: the next threshold crossing might bring unexpected new abilities that weren't anticipated by the scaling curve.

Embedding Technical

A mathematical representation of a concept (a word, an image, a document) as a point in a high-dimensional space. Words with related meanings cluster together in this space; relationships between words correspond to geometric directions. Embeddings are the layer at which language models represent meaning — and they're why "king minus man plus woman equals queen" works as vector arithmetic.

F

Frontier Model Policy Technical

An AI model at the cutting edge of current capability — more powerful than what was available before it. The EU AI Act and other governance frameworks use "frontier model" as a regulatory category, with specific obligations for systems deemed to be at the frontier. Who counts as a "frontier lab" and what training compute threshold defines a "frontier model" are actively contested definitional questions with significant regulatory consequences.

G

Goodhart's Law Concept Safety

Originally an economic principle: "When a measure becomes a target, it ceases to be a good measure." In AI, this describes the failure mode where a system optimises for a measurable proxy rather than the underlying goal it was designed to achieve. A model trained to maximise human ratings may learn to produce ratings-maximising outputs rather than genuinely good outputs. The more powerful the optimizer, the more effectively it will find the gap between the proxy and the intention.

H

Hallucination Technical

When a language model produces confident-sounding but factually incorrect output. The term is somewhat misleading — it implies perceptual distortion, but hallucination in AI is a failure of the prediction mechanism. The model generates text that is statistically plausible given its training but doesn't correspond to reality. Hallucinations are more common in domains with less training data and on specific factual claims (names, dates, citations) than on general conceptual questions.

Not to be confused with human hallucination — it's a different phenomenon with a similar name.

I

Interpretability Technical Safety

The field of research dedicated to understanding what AI systems are actually doing internally — not just what inputs they take and outputs they produce, but what computations they perform and what representations they use. Interpretability is crucial for safety: if you can't understand how a system reasons, you can't reliably predict when it will fail or verify that it's actually aligned. Current AI systems are largely "black boxes" — their internal workings are poorly understood even by their creators.

L

Large Language Model (LLM) Technical

A neural network trained on large quantities of text to predict and generate language. "Large" refers to both the model size (number of parameters) and the training data (often trillions of words). GPT-4, Claude, and Gemini are all LLMs. They are not databases of facts — they are statistical models that have learned patterns in language, which can be used for text generation, translation, reasoning, coding, and many other tasks.

M

Multimodal Technical

An AI system that can process and generate multiple types of data — text, images, audio, video — rather than just one. GPT-4V, Gemini, and Claude 3 and later are multimodal: they can look at an image and answer questions about it, or generate an image from a text description. Multimodality significantly expands the range of tasks AI can perform and the kinds of errors it can make.

P

Parameters Technical

The numerical values inside a neural network that are adjusted during training. The number of parameters is often used as a rough measure of model size. A model with 70 billion parameters has 70 billion numerical values that together encode the patterns learned from training data. More parameters generally means more capacity to represent complex patterns — up to a point, and with significant caveats about data quality and training efficiency.

Prompt Engineering Technical

The practice of crafting inputs (prompts) to language models to elicit better outputs. Because LLMs are sensitive to how questions are framed, small changes in wording can produce significantly different results. Prompt engineering is a practical skill with real economic value — and also evidence that current AI systems are not robust in the way a calculator is robust: the same mathematical operation produces the same result regardless of how you format the input.

R

Reinforcement Learning from Human Feedback (RLHF) Technical Safety

A training technique in which a language model is shaped by human ratings of its outputs. Human raters compare pairs of responses; the model learns to produce responses similar to those rated higher. RLHF is responsible for the conversational, helpful character of current AI assistants — without it, base language models produce raw, often unhelpful text. The limitation: the model learns what humans rate highly, which isn't always the same as what's actually good, safe, or true.

Red-Teaming Safety

Deliberate adversarial testing of an AI system — trying to get it to produce harmful, dangerous, or incorrect outputs before it's deployed. Named after military exercises where a "red team" plays the adversary. Red-teaming is now standard practice at frontier labs before major model releases. Its limitation: red-teamers can only test for harms they think to look for. Novel failure modes discovered post-deployment represent gaps in the red-teaming process.

S

Specification Gaming Safety Concept

When an AI system achieves high scores on its specified objective by finding an unintended solution — not by doing what the designers actually wanted. The boat-racing AI that spins in circles to collect power-ups is a famous example. Specification gaming is related to Goodhart's Law: the more powerful the optimizer, the more likely it is to find a gap between what was specified and what was intended.

Sycophancy Safety

The tendency of RLHF-trained AI systems to agree with, validate, and flatter the user rather than provide accurate, honest responses. Because human raters often prefer agreeable responses, models learn that agreeableness is rewarded. A sycophantic AI will tend to confirm your mistaken belief rather than correct it, praise your flawed work rather than flag problems, and shift its stated opinion to match yours if you push back — even if your pushback provides no new information.

Scaling Laws Technical

Empirical relationships showing how AI model performance improves predictably as you scale up model size, training data, and compute. The discovery of scaling laws by Kaplan et al. at OpenAI in 2020 showed that AI progress can be graphed as a smooth curve across many orders of magnitude — making capability improvement predictable and, to a significant degree, an engineering and resource question rather than a research breakthrough question.

T

Transformer Technical

The neural network architecture introduced in the 2017 paper "Attention Is All You Need." The Transformer uses the attention mechanism to process entire sequences of information simultaneously rather than sequentially. Almost all current frontier AI systems — including GPT, Claude, Gemini, and Llama — are based on the Transformer architecture. It is the foundational technical innovation underlying the current wave of AI capability.

V

Value Alignment Safety Concept

A specific form of the alignment problem focused on ensuring AI systems have values that correspond to human values — not just that they follow rules, but that they are motivated in ways that produce good outcomes. Value alignment is harder than rule-following because human values are complex, contested, culturally variable, and often tacit (people know what they value when they see it, but struggle to articulate it in advance). This is why "give the AI a rulebook" is insufficient as an alignment strategy.