1 Attention is all you need
Transformers replace recurrence with self-attention: every token looks at every other token and decides how much to focus on each. The heatmap below shows the attention weights between three tokens — change their values and watch the focus shift.