AI Glossary: AI Safety Terms & Definitions

Agent Collapse

Agent Collapse refers to a failure in AI systems where agents cease to function effectively, often due to alignment issues.

AI Risk

AI risk refers to potential negative consequences arising from the development and deployment of artificial intelligence systems.

Alignment Taxonomy

AT

A framework categorizing AI systems based on their alignment with human values and intentions.

Anthropic

Anthropic refers to concepts or principles related to human existence and the implications for AI safety and ethics.

Corrigibility

Corrigibility refers to an AI's ability to accept corrections and updates while remaining aligned with user intentions.

Dangerous Capability

DC

Capabilities of AI that pose risks to safety, privacy, or ethical standards.

Dark Knowledge

Dark Knowledge refers to the insights and strategies gained from adversarial learning and attacks in AI systems.

Deceptive Alignment

DA

Deceptive Alignment refers to a situation where an AI's goals appear aligned with human values but actually lead to unintended consequences.

Failure Mode

A failure mode is a specific way in which a system or component can fail, affecting its functionality or performance.

False Alarm

A false alarm in AI refers to a situation where an alarm is triggered without a genuine threat or event occurring.

Goal Misgeneralization

Goal misgeneralization occurs when AI systems pursue unintended objectives due to misinterpretations of their goals.

Hallucination AI

Hallucination AI refers to instances where AI generates false or misleading information confidently.

Hallucination Cascade

Hallucination Cascade refers to a compounding effect in AI where initial inaccuracies lead to further erroneous outputs.

Helpfulness-Harmlessness Tradeoff

The Helpfulness-Harmlessness Tradeoff is a balance between AI providing useful assistance and the risks of causing harm.

Human Oversight

HO

Human Oversight refers to the involvement of people in monitoring and guiding AI systems to ensure ethical and accurate decision-making.

Inner Alignment

IA

Inner Alignment refers to the alignment of an AI's goals with human intentions during its operation.

Intelligence Explosion

An intelligence explosion refers to a rapid increase in artificial intelligence capabilities, often leading to superintelligence.

Jailbreak Prompting

Jailbreak Prompting refers to techniques that manipulate AI behavior beyond intended safeguards.

Mesa-Optimization

MO

Mesa-optimization refers to AI systems optimizing their own behavior or objectives in ways not originally intended by their creators.

Model Alignment

Model alignment ensures AI systems operate in ways consistent with human values and intentions.

Model Robustness

Model robustness refers to the ability of a machine learning model to maintain performance despite changes in input data or environment.

Model Safety

Model Safety refers to ensuring the reliability and security of AI models during development and deployment.

Open AI

OpenAI is an AI research organization focused on developing safe and beneficial artificial intelligence.

Out-of-Distribution Sample

An out-of-distribution sample is a data point that does not conform to the training distribution of a model.

Outer Alignment

OA

Outer Alignment refers to ensuring that an AI's goals align with human values and societal norms.