If they calibrated directly to market quotes, they would run into the problem of overfitting to a limited dataset. But since they calibrate to model quotes, they can generate as many training and test samples as they like! I like that.
Yes, this approach feels good. In the current case they took 85% of data for training and 15% for testing to evaluate network performance. From the 85% ==> 65% for explicit training and 20% for validation. At my request they performed 5-fold cross validation. (I also requested confusion matrix to see how the classification works, but maybe time is up). Fold cross validation errors: MSE ~ 9 X 1.0e-8, MAE ~ 0.00020 (Thesis #1).
Synthetic SABR data based on 3'000'00 uniformly generated random samples for each model parameter.
For the Heston model, the idea is to generate by 1) analytical, 2) Soviet Splitting (Yanenko) (In the "West", Craig-Sneyd (ADI) is used). It will be interesting to see what comes out.
How often does the data need to be trained (again): every 6 months, ..., seconds??