🤖

Transformer Networks Advanced

The architecture behind modern LLMs. Visualise self-attention weights between tokens and implement scaled dot-product attention in JS.

4 lessons 12 quiz questions ⚡ Live compiler

Lessons & quizzes Compiler Certificate

📚 Lessons & quizzes

Each lesson ends with its own short quiz. Answer them as you go — score 90% across all lessons to earn your certificate.

1 Attention is all you need

Transformers replace recurrence with self-attention: every token looks at every other token and decides how much to focus on each. The heatmap below shows the attention weights between three tokens — change their values and watch the focus shift.

2 Query, Key, Value

Each token is projected into a Query, Key and Value. Attention scores are softmax(Q·Kᵀ / √d); the output for a token is the weighted sum of all Values. The √d scaling keeps the scores from growing too large.

function softmax(v){const m=Math.max(...v);const e=v.map(x=>Math.exp(x-m));const s=e.reduce((a,b)=>a+b);return e.map(x=>x/s);}
const dot=(a,b)=>a.reduce((s,x,i)=>s+x*b[i],0);
function attention(q, K){ const d=q.length; return softmax(K.map(k=>dot(q,k)/Math.sqrt(d))); }
console.log(attention([1,0],[[1,0],[0,1],[1,1]]));

3 Heads, position and the stack

Real Transformers use multi-head attention (several attention patterns in parallel), add positional encodings so order is known, and stack many blocks with feed-forward sub-layers, residual connections and normalisation.

4 Why Transformers took over

Because attention processes a whole sequence in parallel and models long-range relationships directly, Transformers scale to huge datasets and parameter counts — the foundation of modern large language models, vision Transformers and multimodal systems.

⚡ Transformer Networks Compiler

Code runs live in a sandboxed frame, rendered by your browser.

🎓 Certificate of Completion

🔒 Complete every lesson quiz above with 90%+ to unlock your downloadable certificate.