AI Safety / Alignment
The discipline of trying to prevent the universe's largest unintended consequence.
Kernel
AI safety is the technical-philosophical field built around the proposition that systems much smarter than humans, if pursued without solving alignment first, plausibly end the human project. The field has three substantively distinct schools — MIRI/Yudkowskian doom, Anthropic/empirical interpretability, and OpenAI/practical-deployment safety — that share the framing but disagree fundamentally on tractability, timelines, and whether to slow down.
Origins
Yudkowsky's early-2000s SIAI/MIRI work formalizes the alignment-as-research-program move. Bostrom's Superintelligence (2014) makes it intellectually respectable. The 2015 founding of OpenAI as a "safety-first" lab and the 2021 founding of Anthropic split the field into operational programs.
Doctrine
Capability without alignment is a misuse. Optimization processes are not benign. Mesa-optimization, deceptive alignment, and inner misalignment are technical failure modes worth worrying about specifically. Constitutional AI and RLHF are partial solutions. Interpretability is the long-term hope. The field needs more talented researchers and less Twitter.
Lineage
Yudkowsky → Bostrom → Christiano (RLHF, 2017) → Olah (mech interp, 2020) → Hubinger (deceptive alignment, 2019) → Anthropic's interp team (2024) → the broader empirical-safety community. The lineage has moved from "this might be an issue" to "here are the specific failure modes and we are running experiments on them."
Conflicts
Pause (Yudkowsky 2023, FLI letter) vs. race-to-build-aligned (Altman, Amodei). Open weights (Andreessen, Meta) vs. closed (Anthropic, OpenAI). Empirical (Anthropic) vs. theoretical (MIRI). Each conflict is real and largely unresolved.
Trajectory
The 2024–2026 era is empirical safety's moment. Mechanistic interpretability has produced real artifacts (sparse autoencoders, feature catalogs). The doom side has retreated rhetorically but not analytically. The field is more crowded, better-funded, and less unified than at any prior point.