AI Glossary: AI Evaluation Terms & Definitions

Baseline Accuracy

Baseline accuracy is the minimum accuracy a model must achieve to be considered effective.

Benchmark Saturation

Benchmark saturation refers to the point at which adding more benchmarks does not yield significant improvements in performance assessment.

BIG-Bench Lite

BBL

BIG-Bench Lite is a benchmark for evaluating large language models using a diverse set of tasks.

Bleu Score Metric

BLEU

The Bleu Score Metric evaluates the quality of machine-generated text against reference texts.

Capability Evaluation

CE

Capability Evaluation assesses an AI system's performance and effectiveness in specific tasks or functions.

CIDEr Score

CIDEr

CIDEr Score is a metric for evaluating image captioning models based on consensus with human-generated captions.

Circular Reasoning Loop

A circular reasoning loop occurs when a conclusion is derived from premises that assume the conclusion is true.

Comparative Evaluation

Comparative Evaluation assesses the performance of AI systems by comparing them against each other using defined metrics.

Confusion Matrix Metrics

Confusion Matrix Metrics evaluate classification model performance using key indicators like accuracy, precision, recall, and F1 score.

Control Group

A control group is a baseline group used in experiments to compare against the treatment group.

Cross Validation Folds

CV Folds

Cross Validation Folds are subsets of data used to validate machine learning models, enhancing their reliability and performance.

Data Drift

Data drift occurs when the statistical properties of data change over time, affecting model performance.

Degenerate Mode

Degenerate Mode refers to a state in AI systems where performance degrades or fails to meet expectations.

Deployment Drift

Deployment Drift refers to the divergence of AI models from their training conditions post-deployment.

Development Set

A development set is a subset of data used to fine-tune AI models during the training process.

Equal Error Rate

EER

The Equal Error Rate (EER) is a metric used to evaluate the performance of biometric systems.

Error Analysis

Error analysis involves examining the errors made by AI models to improve their performance and reliability.

Error Rate

Error Rate measures the frequency of incorrect predictions made by an AI model compared to the total predictions.

Evaluating AI

Evaluating AI involves assessing AI systems to ensure effectiveness, accuracy, and alignment with intended goals.

Evaluation Gaming

Evaluation gaming involves using game-based methods to assess AI systems' performance and behavior.

F-Measure

F1

F-Measure is a metric used to evaluate the performance of classification models, balancing precision and recall.

Factuality Calibration

Factuality calibration ensures AI-generated content aligns with real-world facts.

Failure Mode

A failure mode is a specific way in which a system or component can fail, affecting its functionality or performance.

False Positive

FP

A false positive in AI refers to an incorrect result where a model incorrectly identifies a positive outcome.

False Positive Rate

FPR

The False Positive Rate measures the proportion of incorrect positive predictions in a model's output.

Falsifiability

Falsifiability refers to the ability of a theory to be proven false by evidence.

Fidelity Gap

Fidelity Gap refers to the difference between expected and actual performance in AI systems.

Fold Cross-Validation

Fold Cross-Validation is a technique for assessing how the results of a statistical analysis will generalize to an independent dataset.