More precisely: at the training stage, I would draw a minibatch of pictures and transform each of them by translating it by a vector in plane and rotating. I would calculate the 3D loss function for each transformation and look for its minimum. It cannot be any minimum though - it needs to be a smooth bump rather than a sharp peak. The procedure sounds cumbersome, but there are many statistical methods which would accelerate it. There's a risk that the network would start to confuse objects which e.g. look similar to different objects if one rotates them, but there's a chance that the network would reach for finer differences to successfully train itself.
Apart from translation and rotation, one could consider other transformations, such as asymmetric scaling to manipulate the perspective. This would be much more challenging, but could potentially teach the network to recognise the same object seen at different angles.
Anyway, I would try this out myself but I don't have sufficient computational power

(If I had, I would use for something more interesting anyway

)