A few thoughts on sharpness-flatness:
1) In a deeply layered network, a subtle and relatively soft nonlinearity in each layer might amplify up the layers to create an extremely sharp function -- approaching a step function -- between a deep input and a distant output. Yet the exact location of the step might be a relatively mild function of any of the parameters in those layers (although the parameter values would be strongly correlated across the layers). Thus sharpness can be emergent across an extended area of the net rather than always localized and optimize to a few nodes.
You're talking about something else than the paper I cited.
2) A sufficiently large NN has so many DOF, that it can express many different solution architectures that have equivalent output performance despite having very different internal structures to the parameters and flatness/sharpness properties. Perhaps one reason for the robustness of performance is that these systems have so many DOF that if one part of the net gets locked into a suboptimal local minimum, some other part of the net finds a better minimum or compensating function.
That would explain successful training, not successful generalisation. A common mistake...
4) Some aspects of flatness<->sharpness are intrinsic to the system. If you are attempting to predict the phase of water as a function of temperature under equilibrium conditions, the transitions between solid, liquid, and gas are perfectly sharp -- 100.1°C water will ALWAYS be a gas. But if you are attempting to predict the phase of water as a function of temperature under dynamic, non-equilibrium conditions, the transitions between solid, liquid, and gas are mushy -- 100.1°C water might be liquid in a large fraction of the training set and in real life.
Again, you're talking about a different "sharpness" than the article I cited.
Dudes, can you please read the paper before commenting?
Surprisingly, this time it was outrun who was the closes to getting the point.