High-dimensional Analysis of Synthetic Data Selection

Abstract

Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.

Method

Covariance matching selection from a synthetic pool, done per class.

Setup

Real training set \((X_t, y_t)\) and synthetic augmentation set \((X_s, y_s)\).
Train on the union, evaluate on test distribution matching real training.
Selection is done class by class from a generated pool.

Selection objective

In practice, compute features for real and synthetic samples (CLIP is a strong default). Then select a subset of synthetic samples whose sample covariance best matches the real sample covariance.

# Pseudocode (per class)
Input: real features R (nt x p), synthetic pool features P (N x p), target size ns
Fit PCA on R, keep d dims (e.g., d = 32)
Project: R_d, P_d
Compute target covariance Ct = cov(R_d)

S = empty
while |S| < ns:
  pick x in pool that minimizes || cov(S ∪ {x}) - Ct ||_F
  add x to S

Return selected indices S

Replace feature extractor, distance, or greedy step with your preferred implementation.

covariance matching gradually selects data matching the sample covariance matrix of the real samples.

CIFAR-10

Covariance matching outperforms all baselines across three training paradigms on CIFAR-10, when the synthetic data is generated via five truncated StyleGAN2-Ada models.

ImageNet-100

RxRx1

Refer to the paper for more experiments and ablations of various aspects of the experimental setup.

BibTeX

@misc{rezaei2025highdimensionalanalysissyntheticdata,
  title={High-dimensional Analysis of Synthetic Data Selection},
  author={Parham Rezaei and Filip Kovacevic and Francesco Locatello and Marco Mondelli},
  year={2025},
  eprint={2510.08123},
  archivePrefix={arXiv},
  primaryClass={stat.ML},
  url={https://arxiv.org/abs/2510.08123},
}

@inproceedings{
rezaei2025highdimensional,
title={High-dimensional Analysis of Synthetic Data Selection},
author={Parham Rezaei and Filip Kova{\v{c}}evi{\'c} and Francesco Locatello and Marco Mondelli},
booktitle={EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)},
year={2025},
url={https://openreview.net/forum?id=LkLRHXsvWG}
}