The encoder in a Transformer processes the input sequence and computes key, query, and value vectors. Attention weights are computed using the key and value vectors, while the query vector is used to produce the output. This is done by taking the dot product of the query and key vectors and scaling the result. The output is computed by taking the weighted sum of the value vectors, using the attention weights as the weights. This is then repeated in multiple layers in parallel to learn increasingly complex representations of the input, which brings about the term multi-head attention. Thus, combining the results for a final score, allows the Transformer to encode multiple contextual relationships for each word in a sequence.
In spite of all this, since GPTs are trained on large data sets, they do have training data bias reflecting on the generated text. Since it is generative in nature, it has the potential of generating inappropriate content due to a lack of understanding of the true meaning of the context. Limited long-term memory will be one of the drawbacks, unlike humans, they are unable to maintain coherence and consistency in longer pieces of text or over multiple exchanges in a conversation.
(C) ChatGPT
To rectify the shortcomings, OpenAI introduced a twist of including human feedback in the training process to improve the GPT-3 model's output to match user intent. This technique is called Reinforcement Learning from Human Feedback (RLHF), which is explained in detail in OpenAI's 2022 paper titled "Training language models to follow instructions with human feedback".