This concludes this basic introduction to transformers which aspired to be mathematically precise and to provide intuitions behind the design decisions.We have not talked about loss functions or training in any detail, but this is because rather standard deep learning approaches are used for these. Briefly,transformers are typically trained using the Adam optimiser. They are often slow to train compared to other architectures and typically get more unstable as training progresses. Gradient clipping, decaying learning rate schedules, and increasing batch sizes through training help to mitigate these instabilities, but often they still persist.The real point is that we should never use LLMs for knowledge. They are only language models, not knowledge models.
This mis-use of LLMs to answer our questions is all part of the misdirection of the LLM hype and lies.
LLMs are not capable of being factual or reliable and are a dead end technology whilst based on transformers.
oops