AI Glossary: AI Benchmarking Terms & Definitions

Benchmark Saturation

Benchmark saturation refers to the point at which adding more benchmarks does not yield significant improvements in performance assessment.

BB

BIG-Bench is a benchmark suite designed to evaluate the performance of large language models across diverse tasks.

BB-Hard

BigBench-Hard is a challenging benchmark for evaluating AI models on diverse NLP tasks and complex reasoning.

The CIFAR-100 dataset is a collection of 60,000 32x32 color images in 100 classes for machine learning research.

FID

Fréchet Inception Distance (FID) measures the quality of generated images by comparing their distribution to real images.

HE

HumanEval is a benchmark for evaluating AI programming models using coding tasks.

The Overall Score is a composite metric reflecting the performance of an AI model across multiple evaluation criteria.