High-dimensional Analysis of Synthetic Data Selection

ICLR 2026 Oral

Institute of Science and Technology Austria (ISTA)
Synthetic data selection teaser figure showing covariance matching

Abstract

Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.

What matters

Covariance shift affects excess risk; mean shift drops out in mixed training.

What is optimal

Under natural normalization, matching synthetic covariance to real covariance is optimal.

Why it matters

The same principle transfers well to modern feature extractors and deep models.

Theory

What do we prove?

Setup

Setup and assumptions

We study linear regression with real training data \((X_t, y_t)\) and synthetic augmentation data \((X_s, y_s)\), both generated by the same underlying parameter \(\beta\).

\(y_{(i)} = X_{(i)}\beta + \varepsilon_{(i)}\), \(\quad X_{(i)} = Z^{(i)}\Sigma_{(i)}^{1/2} + 1_{n_{(i)}}\mu_{(i)}^\top,\quad i\in\{t,s\}\).

Distributions

Real and synthetic data may have different means \(\mu_t,\mu_s\) and covariances \(\Sigma_t,\Sigma_s\), but they share the same conditional label model.

Proportional regime

Dimension and sample sizes scale together: \(n_t/p\), \(n_s/p\), and \(n/p\) stay of constant order.

Regularity

Noise is centered with variance \(\sigma^2\); the entries of \(Z^{(i)}\) have bounded moments; and the spectra of \(\Sigma_t,\Sigma_s\) are bounded away from \(0\) and \(\infty\).

Estimator and risk

We train min-norm least squares on the union of real and synthetic samples, and evaluate excess risk on the real target distribution.

Deterministic equivalents

Theorem (Under-parameterized, mixed real and synthetic data). Assume \(n>p\), so the min-norm estimator has zero bias. Let \(M=\Sigma_s^{1/2}\Sigma_t^{-1/2}\), and let \(\lambda_1,\dots,\lambda_p\) be the eigenvalues of \(M^\top M\). Then, with high probability,

\[ \lim_{n\to\infty}\left|R_X(\hat{\beta};\beta)-\frac{\sigma^2}{n}\operatorname{Tr}\!\left[\left(\alpha_1 M^\top M+\alpha_2 I_p\right)^{-1}\right]\right|=0, \] \[ \alpha_1+\alpha_2=1-\frac{p}{n},\qquad \alpha_1+\frac{1}{n}\sum_{i=1}^p\frac{\lambda_i\alpha_1}{\lambda_i\alpha_1+\alpha_2}=\frac{n_s}{n}. \]

The deterministic equivalent depends only on the covariance shift through \(M\), not on the means \(\mu_t,\mu_s\).

Proposition (Under-parameterized, training only on synthetic data)

If \(n_t=0\) and training uses only the synthetic distribution, then the mean shift reappears explicitly:

\[ \lim_{n\to\infty}\left|R_X(\hat{\beta};\beta)-\frac{\sigma^2}{n}\cdot\frac{\gamma}{\gamma-1}\left[ \operatorname{Tr}(\Sigma_t\Sigma_s^{-1}) +\|\Sigma_s^{-1/2}\mu_t\|_2^2 -\left(\frac{\mu_t^\top\Sigma_s^{-1}\mu_s}{\|\Sigma_s^{-1/2}\mu_s\|_2}\right)^2 \right]\right|=0. \]

This is the contrast with the mixed-training theorem above: once real data is absent, the means matter again.

Show the over-parameterized theorem

Theorem (Over-parameterized, mixed real and synthetic data). Assume \(n< p\), sample \(\beta\) independently from a sphere of constant radius, and assume that \(\Sigma_s\) and \(\Sigma_t\) are simultaneously diagonalizable: \(\Sigma_s=U\Lambda^sU^\top\), \(\Sigma_t=U\Lambda^tU^\top\).

\[ \hat H_p(\lambda^s,\lambda^t)=\frac{1}{p}\sum_{i=1}^p\mathbf 1_{\{(\lambda^s,\lambda^t)=(\lambda_i^s,\lambda_i^t)\}}, \qquad \hat G_p(\lambda^s,\lambda^t)=\sum_{i=1}^p\langle\beta,u_i\rangle^2\mathbf 1_{\{(\lambda^s,\lambda^t)=(\lambda_i^s,\lambda_i^t)\}}. \] \[ \lim_{n\to\infty}\left|R_X(\hat{\beta};\beta)-\mathcal V(\Sigma_s,\Sigma_t)-\mathcal B(\Sigma_s,\Sigma_t,\beta)\right|=0. \] \[ \mathcal V(\Sigma_s,\Sigma_t)=\frac{\sigma^2}{\gamma}\int \frac{-\lambda^t(a_3\lambda^s+a_4\lambda^t)}{(a_1\lambda^s+a_2\lambda^t+1)^2}\,d\hat H_p(\lambda^s,\lambda^t), \] \[ \mathcal B(\Sigma_s,\Sigma_t,\beta)=\int \frac{b_3\lambda^s+(b_4+1)\lambda^t}{(b_1\lambda^s+b_2\lambda^t+1)^2}\,d\hat G_p(\lambda^s,\lambda^t). \]

Here \(a_i\) and \(b_i\), for \(i\in\{1,2,3,4\}\), are the unique solutions of the fixed-point equations stated in the appendix of the paper. Unlike the under-parameterized regime, both the variance and the bias contribute to the limit.

Even here, once real and synthetic data are used together, the deterministic equivalent still drops the mean shift.

Optimized synthetic data

The deterministic equivalents turn the selection question into an optimization problem over the synthetic covariance. We prove covariance matching directly from these limits.

Theorem (Under-parameterized optimality)

Define \(\mathcal M=\{M\in\mathbb R^{p\times p}:\mathrm{rank}(M)=p,\ \mathrm{Tr}(M^\top M)=p\}\), and let \(M_{\mathrm{opt}}\) minimize the limit risk from the under-parameterized theorem:

\[ M_{\mathrm{opt}}=\arg\inf_{M\in\mathcal M}\mathcal R_u(M). \] \[ \lambda_i(M_{\mathrm{opt}}^\top M_{\mathrm{opt}})=1,\qquad \forall i\in\{1,\ldots,p\}. \]

So, under trace normalization, the optimum has a perfectly balanced spectrum. Equivalently, given \(\Sigma_t\), choosing \(\Sigma_s\propto\Sigma_t\) is optimal: matching covariance is the right objective.

Show the over-parameterized optimality theorem

Theorem (Over-parameterized near-optimality). Let \(\mathcal S=\{\Sigma\in\mathbb R^{p\times p}_{\succ 0}:\mathrm{Tr}(\Sigma)=p\}\), and define \(\mathcal R_o(\Sigma_s,\Sigma_t,\beta)=\mathcal V(\Sigma_s,\Sigma_t)+\mathcal B(\Sigma_s,\Sigma_t,\beta)\). Then, for isotropic training data \(\Sigma_t=I_p\), with high probability over \(\beta\),

\[ \mathcal R_o(I_p,I_p,\beta)\leq \mathcal R_o(\Sigma_s,I_p,\beta)+o(1),\qquad \forall \Sigma_s\in\mathcal S. \]

So in the over-parameterized regime, covariance matching is again optimal up to a vanishing term, at least when the real training covariance is isotropic.

Interpretation

The theory says to match the spread of the real distribution, not just its center. This is exactly the principle used later in the practical covariance-matching selector built on learned features.

Algorithm

What algorithm to use? Covariance Matching

Setup

  • Real training set \((X_t, y_t)\) and synthetic augmentation set \((X_s, y_s)\).
  • Train on the union, evaluate on test distribution matching real training.
  • Selection is done class by class from a generated pool.

Selection objective

In practice, compute features for real and synthetic samples. Then select a subset of synthetic samples whose sample covariance best matches the real sample covariance.

# Pseudocode (per class)
Input: real features R (nt x p), synthetic pool features P (N x p), target size ns
Fit PCA on R, keep d dims (e.g., d = 32)
Project: R_d, P_d
Compute target covariance Ct = cov(R_d)

S = empty
while |S| < ns:
  pick x in pool that minimizes || cov(S ∪ {x}) - Ct ||_F
  add x to S

Return selected indices S

Each accepted sample rotates and reshapes the synthetic covariance toward the real one

Real features Selected synthetic set Current candidate
Real target covariance Synthetic data pool Selected synthetic set Choose the point that most reduces ||Cov(S ∪ {x}) - Ct||F

Selection rounds 0 / 6

Experiments

Does Covariance Matching work in practice?

Dataset

CIFAR-10

We use 200 real samples per class as reference and select 800 synthetic samples per class from a 10K-image pool.

Loading table...
Loading table...
Loading table...
Loading table...

Truncated Models: 6K images from a 0.2-truncated StyleGAN2-Ada model with three random truncation centers, plus 4K images from a 0.6-truncated model with two random centers.

T2I Models: 4K SANA-1.5 images, 4K PixArt-alpha images, and 2K Stable Diffusion 1.4 images.

ImageNet-100

We use 200 real samples per class as reference and select 800 synthetic samples per class from a 10K-image pool.

Loading table...
Loading table...

Truncated Models: 6K images from a 0.2-truncated StyleGAN-XL model with three random truncation centers, plus 4K images from a 0.6-truncated model with two random centers.

T2I Models: 4K SANA-1.5 images, 4K PixArt-alpha images, and 2K Stable Diffusion 1.4 images.

RxRx1

MorphGen produces a pool of 500 synthetic images per class for 2 classes; selection augments 30 real images per class with 60 synthetic images per class, and evaluation uses a linear classifier on frozen ImageNet-pretrained ResNet features.

Loading table...

TweetIrony

We sample 100 real tweets and augment them with 300 GPT-generated tweets, selecting in all-mpnet-base-v2 embedding space.

Loading table...

BibTeX

@inproceedings{
rezaei2026highdimensional,
title={High-dimensional Analysis of Synthetic Data Selection},
author={Parham Rezaei and Filip Kova{\v{c}}evi{\'c} and Francesco Locatello and Marco Mondelli},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=Y54P2BBPPh}
}