◇ AI Belief Systems2000–

AI Safety / Alignment

The discipline of trying to prevent the universe's largest unintended consequence.

Kernel

AI safety is the technical-philosophical field built around the proposition that systems much smarter than humans, if pursued without solving alignment first, plausibly end the human project. The field has three substantively distinct schools — MIRI/Yudkowskian doom, Anthropic/empirical interpretability, and OpenAI/practical-deployment safety — that share the framing but disagree fundamentally on tractability, timelines, and whether to slow down.

§ 01

Origins

Yudkowsky's early-2000s SIAI/MIRI work formalizes the alignment-as-research-program move. Bostrom's Superintelligence (2014) makes it intellectually respectable. The 2015 founding of OpenAI as a "safety-first" lab and the 2021 founding of Anthropic split the field into operational programs.

§ 02

Doctrine

Capability without alignment is a misuse. Optimization processes are not benign. Mesa-optimization, deceptive alignment, and inner misalignment are technical failure modes worth worrying about specifically. Constitutional AI and RLHF are partial solutions. Interpretability is the long-term hope. The field needs more talented researchers and less Twitter.

§ 03

Lineage

Yudkowsky → Bostrom → Christiano (RLHF, 2017) → Olah (mech interp, 2020) → Hubinger (deceptive alignment, 2019) → Anthropic's interp team (2024) → the broader empirical-safety community. The lineage has moved from "this might be an issue" to "here are the specific failure modes and we are running experiments on them."

§ 04

Conflicts

Pause (Yudkowsky 2023, FLI letter) vs. race-to-build-aligned (Altman, Amodei). Open weights (Andreessen, Meta) vs. closed (Anthropic, OpenAI). Empirical (Anthropic) vs. theoretical (MIRI). Each conflict is real and largely unresolved.

§ 05

Trajectory

The 2024–2026 era is empirical safety's moment. Mechanistic interpretability has produced real artifacts (sparse autoencoders, feature catalogs). The doom side has retreated rhetorically but not analytically. The field is more crowded, better-funded, and less unified than at any prior point.

Key thinkers

Eliezer YudkowskyNick BostromPaul ChristianoChris OlahDario AmodeiJan Leike

Key concepts

AlignmentMesa-optimizationDeceptive alignmentMechanistic interpretabilityConstitutional AI

Position in the universe

Where this node sits.

Open full graph →

The Hacker Lineage

The Venture Religion

AI Belief Systems

Empire & Power

Counter-Currents

lineage kin conflict