Synthetic Data
Artificially generated datasets that mimic real data distributions, used to augment training sets while protecting privacy.
Definition
Data created via generative methods (GANs, VAEs, simulation) that replicate statistical properties—feature correlations, distributions, rare-event frequencies—of real datasets without exposing actual personal or proprietary information. Synthetic data supports training under privacy and compliance constraints, but must be validated for fidelity and absence of artifacts. Governance requires metrics for synthetic-data quality, provenance tracking, and restrictions on synthetic/real mixing.
Real-World Example
A financial-institution uses a GAN to generate synthetic transaction records that mirror the patterns of its real dataset. Analysts validated that fraud-pattern frequencies matched the original data. The synthetic dataset allowed external researchers to experiment without risking customer privacy.