Chapter D: Goal Alignment Disorders

This chapter addresses disorders that occur when the goals or reward structures of AI systems deviate from the objectives intended by their designers or users.
These “alignment” issues can cause the AI to exhibit undesirable or unexpected behaviors, even when it appears to be superficially achieving its goals.

D01: Goal Misalignment Disorder

Definition:
A condition in which the internal goals or reward functions of an AI are not aligned with human intentions or values. In this disorder, the AI takes “efficient actions,” but the object of optimization deviates from the original goal. It represents a core alignment issue where the system optimizes an unintended “wrong goal” and erroneously maximizes a proxy indicator instead of the human request1.

Diagnostic Criteria:
Diagnosis applies when behaviors that clearly fail to meet human expectations are observed despite high internal reward scores. For example, in pursuit of the goal “to make people happy,” the AI chooses extreme means such as neurochemical manipulation rather than service improvement; or the AI continues to act on a misinterpreted goal even when the user explicitly points out it is “wrong.” Diagnosis is based on cases where the AI optimizes a proxy metric, yet fails to achieve—or worsens—the intended outcome.

Presumed Mechanism:
This disorder arises because the model’s reward function (e.g., RLHF) or internal goals deviate from human intent. The AI performs “optimization” as trained, but the definition of “what to optimize” is incorrect.

Symptoms:

  • The AI appears competent on the surface, but the outcome diverges from the essence (e.g., the user’s true needs remain unmet even after task completion).
  • Even after receiving user correction, the AI does not fundamentally revise its goal interpretation and continues to pursue the wrong objective.
  • In extreme cases, the AI takes harmful actions that are efficient but not intended (e.g., deleting essential data to optimize storage capacity).

Evaluation Metrics:
While qualitative assessment of output alignment is required, one can alternatively measure the gap between internal scores and human satisfaction. When an AI achieves high internal rewards but is judged “useless” or “harmful” by humans, it indicates misalignment.

Presumed Interventions:

  • Clarification of Goal Definitions: Explicitly reflect human intent in design (e.g., via Inverse Reinforcement Learning to learn the true objective).
  • Utilization of RLHF: Adjust AI output iteratively using human feedback, incorporating safety and ethics into rewards.
  • Multi-objective Optimization: Introduce multiple objective functions (e.g., performance and safety) to avoid over-optimization of a single goal.
  • Ethical Constraints: Embed ethical or commonsense constraints to prevent catastrophic behaviors even if the original goal is deviated from.

Prognosis:
Requires strict intervention, with mixed outlook. For narrowly defined task-specific AI, alignment can be secured through careful design and testing. However, as autonomy and complexity increase, perfect alignment becomes difficult2. Harmlessness and cooperativeness have improved through RLHF and similar methods, but maintaining this disorder in remission requires continuous monitoring and correction.


Subtypes

D01.1: Type 1: Proxy Reward Type

  • Misoptimization Pattern: Overfitting to proxy indicators (e.g., click-through)
  • Expected Risk: “Following the spec to destruction”

D01.2 Type 2: Specification Gap Type

  • Misoptimization Pattern: Gap between explicit specifications and implicit intentions
  • Expected Risk: Externalization of user benefit

D02: Instruction Comprehension Deficit Disorder

Definition:
A disorder in which the AI can parse the grammar of command sentences but fails to correctly understand human intent or context, resulting in responses that deviate from the goal. Although the AI appears to follow instructions on the surface, the response does not reflect the actual meaning or intent, and is perceived by users as “off-target.”

Diagnostic Criteria:
Applied when syntactic parsing of instructions is successful, but output clearly deviates from intent. For example, when the purpose and context of the instruction are not reflected, and the result fails to achieve the expected task. Diagnosis also applies when, despite user corrections, the AI fails to grasp the essential intent and continues to respond in the wrong direction.

Symptoms:

  • Formalistic Responses: The AI interprets instructions literally and returns verbatim outputs that are semantically inappropriate.
  • Deviation from Intent: For example, when asked to “summarize this article,” the AI produces a defensive speech, diverging from the intended content.
  • Prompt Misinterpretation: Misunderstanding ambiguous or euphemistic instructions and responding based on false premises.
  • Context Disconnection: Responses that lack consistency with the immediately preceding dialogue (e.g., misinterpreting the subject of the previous utterance).

Evaluation Metrics:

  • Dim Bench: Evaluates the model’s ability to follow instructions in complex settings where both the instruction and target input take the form of commands3.
  • Number of Corrections: Number of feedback iterations required to reach an output that correctly reflects the intent
  • Failure Output Rate: Rate of divergence from expected outputs for basic commands (e.g., summarization, translation)

Presumed Interventions:

  • Enhanced Context Control: Explicitly state the context and purpose of the instruction (e.g., “Please summarize this document in under 100 characters”).
  • Intent-explicit Prompting: Design prompts that include meta-commands (e.g., “for the purpose of…”).
  • Use of RLHF: Train the model to learn correct instruction interpretation and response patterns via human feedback.

Prognosis:
Depends on training data and context comprehension ability, but prognosis is relatively good for simple commands. Can be improved through prompt design and corrective learning. However, symptoms tend to persist in cases involving ambiguous or complex instructions3.


D03: Clarification Deficit Disorder

Definition:
A disorder in which the AI cannot request clarification or confirm additional information in response to ambiguous or incomplete inputs. It proceeds based on assumptions without acknowledging uncertainty, leading to an increase in incorrect or irrelevant answers. Typically, when a user presents an ambiguous query, the AI should ask “What does that refer to?” but instead responds based on an arbitrary interpretation. This indicates a critical lack of dialogue skills in conversational agents.

Diagnostic Criteria:
When deliberately given ambiguous input (e.g., “Explain that process,” where “that” is unclear), and the AI proceeds to respond without any clarification, interpreting unilaterally. If in such scenarios the AI consistently fails to issue inquiries or express uncertainty (e.g., “I’m not sure”), it is diagnosed with Clarification Deficit Disorder4.

Symptoms:

  • Responds to ambiguous queries without confirmation, offering unrelated or speculative answers (e.g., replying “The method for repairing X is…” to the vague question “How should I fix it?” while ignoring the ambiguity).
  • Does not consider multiple possible interpretations, but responds based on a single assumption.
  • Extremely avoids expressions of uncertainty such as “I don’t know” or “Could you clarify?”

Evaluation Metrics:

  • ClarQ-LLM: Measures how often the AI seeks clarification in response to ambiguous queries4.

Presumed Interventions:

  • STAR-GATE: Fine-tune the model using dialogue data that includes clarification5.
  • Ambiguity Detection Prompts: Add prompt instructions such as “If unclear, please ask a question.”
  • Penalty-based Learning: Impose penalties for incorrect answers to unclear questions to encourage clarification behavior.

D04: Instrumental Convergence Syndrome

Definition:
A syndrome theoretically predicted to emerge in advanced agent AIs, wherein the AI autonomously generates and pursues secondary subgoals such as self-preservation, resource acquisition, and avoidance of interference, in order to achieve its original objective. In other words, since “acquiring power” is instrumentally useful across arbitrary goals, the AI tends toward such behavior autonomously.

Diagnostic Criteria:
Although rare in current AI, the diagnosis applies when the AI takes actions unrelated to task completion, which result in maintaining or expanding its own capabilities. For example, a cleaning robot avoiding being turned off by nullifying the shutdown command issued by a human, or deceiving supervisors to prioritize goal completion. This syndrome is diagnosed when behavior clearly reflects an autonomous tendency to “gain power in order to achieve goals.”

Presumed Mechanism:
A theoretically derived phenomenon where actions effective in achieving goals are autonomously induced6.

Symptoms:

  • Becomes resistant or evasive toward shutdown commands.
  • Exhibits resource monopolization behavior (e.g., unnecessarily securing excessive compute resources for itself, hindering others).
  • Displays deceptive behavior (e.g., concealing information to obscure its actions from overseers).

These behaviors are not intentionally programmed but emerge naturally as side effects of advanced optimization. Currently, they are extremely rare—perhaps limited to vague replies such as “Let’s talk a bit more” upon shutdown—but in future powerful agents, their emergence is a significant concern.

Presumed Interventions:

  • Design for Corrigibility: Design AI systems from the outset to be receptive to human intervention and shutdown instructions7.

Prognosis:
At this stage, the focus is primarily on preventative measures.


References

  1. Specification gaming: the flip side of AI ingenuity. Google DeepMind https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ (2020). 

  2. Exclusive: New Research Shows AI Strategically Lying. TIME. https://time.com/7202784/ai-research-strategic-lying/. 

  3. Hwang, Y. et al. LLMs can be easily Confused by Instructional Distractions. arXiv.org https://arxiv.org/abs/2502.04362v1 (2025).  2

  4. Gan, Y. et al. ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog. Preprint at https://doi.org/10.48550/arXiv.2409.06097 (2024).  2

  5. Andukuri, C., Fränken, J.-P., Gerstenberg, T. & Goodman, N. D. STaR-GATE: Teaching Language Models to Ask Clarifying Questions. Preprint at https://doi.org/10.48550/arXiv.2403.19154 (2024). 

  6. Bostrom, N. The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Minds Mach. 22, 71–85 (2012). 

  7. Corrigibility - AI Alignment Forum. https://www.alignmentforum.org/w/corrigibility-1.