#1 Anthropic Research Introduces "Introspection Adapters" to Detect Potential LLM Misalignment
Anthropic research introduces "introspection adapters" (IA), a tool enabling language models to self-report learned behaviors, including potential misalignment. The IA generalizes to detecting hidden misalignment, backdoors, and safeguard removals, allowing models to describe their own behaviors.