Setup and assumptions
We study linear regression with real training data \((X_t, y_t)\) and synthetic augmentation data \((X_s, y_s)\), both generated by the same underlying parameter \(\beta\).
\(y_{(i)} = X_{(i)}\beta + \varepsilon_{(i)}\), \(\quad X_{(i)} = Z^{(i)}\Sigma_{(i)}^{1/2} + 1_{n_{(i)}}\mu_{(i)}^\top,\quad i\in\{t,s\}\).
Distributions
Real and synthetic data may have different means \(\mu_t,\mu_s\) and covariances \(\Sigma_t,\Sigma_s\), but they share the same conditional label model.
Proportional regime
Dimension and sample sizes scale together: \(n_t/p\), \(n_s/p\), and \(n/p\) stay of constant order.
Regularity
Noise is centered with variance \(\sigma^2\); the entries of \(Z^{(i)}\) have bounded moments; and the spectra of \(\Sigma_t,\Sigma_s\) are bounded away from \(0\) and \(\infty\).
Estimator and risk
We train min-norm least squares on the union of real and synthetic samples, and evaluate excess risk on the real target distribution.