Training Dataset
The curated collection of labeled or unlabeled data used to teach an AI model the relationships and patterns it must learn to perform its task.
Definition
A governance-managed asset that undergoes quality checks (accuracy, completeness, representativeness), privacy reviews (PII scrubbing), and bias assessments before use. Training datasets are versioned, cataloged with metadata (source, timestamp, steward), and stored securely. Governance policies ensure that dataset updates trigger revalidation and that dataset lineage is traceable to support reproducibility and compliance.
Real-World Example
A self-driving car team maintains a labeled road-scene dataset collected from diverse geographies. Before adding new data, they run automated checks for labeling consistency, remove PII (license plates), and update the dataset register. Models reference specific dataset versions, ensuring that training inputs are fully auditable.