AI Glossary: Data Quality Terms & Definitions

Data Centric Machine Learning

DCML

Data Centric Machine Learning focuses on improving model performance by enhancing data quality and relevance rather than solely optimizing algorithms.

Data Cleansing

Data cleansing is the process of identifying and correcting errors or inconsistencies in data sets.

Data Curation

Data curation is the process of managing and maintaining data to ensure its quality, accessibility, and usability.

Data Enrichment

Data enrichment enhances existing data by adding valuable context from external sources.

Data Harmonization

Data harmonization is the process of integrating data from different sources to ensure consistency and usability.

Data Leakage

Data leakage occurs when information from outside the training dataset is inadvertently used in model training.

Data Lineage

Data lineage refers to the tracking of data as it moves through various processes, ensuring data integrity and compliance.

Data Profiling

Data profiling involves analyzing data to understand its structure, quality, and relationships.

Data Provenance

DP

Data provenance refers to the history and origin of data, detailing its sources and transformations.

Data Quality

Data Quality refers to the accuracy, consistency, and reliability of data used in AI and analytics.

Data Quality Gate

DQG

A Data Quality Gate is a process that ensures data meets specific quality standards before use.

Data Redundancy

Data redundancy refers to the unnecessary duplication of data within a database or storage system.

Data Scrubbing

Data scrubbing is the process of cleaning and validating data to ensure accuracy and quality.

Data Standardization

Data standardization is the process of transforming data into a common format for consistency and accuracy.

Data Validation

Data validation ensures data accuracy and quality through checks and constraints before processing.

Data Veracity

Data veracity refers to the accuracy, reliability, and truthfulness of data used in AI and analytics.

Entity Resolution

ER

Entity Resolution is the process of identifying and merging records that refer to the same real-world entity across datasets.

Gold Standard Dataset

GSD

A Gold Standard Dataset is a highly accurate and reliable collection of data used for training and evaluating AI models.

Imputation Strategy

An imputation strategy is a method used to fill in missing data in datasets to improve analysis accuracy.

Incomplete Data

Incomplete data refers to missing or unavailable information in datasets used for analysis and AI model training.

Label Noise

LN

Label noise refers to inaccuracies or errors in the labels assigned to data in machine learning tasks.

Label Noise Transition

LNT

Label noise transition refers to the process of mislabeling data in machine learning, affecting model training.

Lossless Compression Failure

Lossless Compression Failure occurs when data cannot be compressed without losing information.

Missing Data

Missing data refers to the absence of values in a dataset, impacting analysis and model performance.

Missing Values Imputation

Missing values imputation is a method to fill in gaps in datasets for analysis and modeling.

NaN Value

NaN

NaN (Not a Number) represents undefined or unrepresentable numerical values in computing.

Noisy Data

Noisy data refers to inaccurate or irrelevant information that can distort analysis and machine learning models.

Noisy Labels

NL

Noisy labels refer to incorrect or misleading annotations in training data that can hinder machine learning model performance.