論文筆記 Attention is All You Need

 · 1 min read

論文出處:Attention Is All You Need

Model Architecture

Imgur

Scaled Dot-Product Attention

\[\mathrm {Attention}(Q, K, V) = \mathrm {softmax}(\frac {QK^T}{\sqrt {d_k}})V\]

參考連結

Multi-Head Attention

\[\mathrm {MultiHead}(Q, K, V) = \mathrm {Concate(head_1, ...., head_h)}W^O\]

where $\mathrm {head_i}$ = $\mathrm {Attention}(QW_i^Q, KW_i^K, VW_i^V)$