論文出處:Attention Is All You Need
Model Architecture
Scaled Dot-Product Attention
\[\mathrm {Attention}(Q, K, V) = \mathrm {softmax}(\frac {QK^T}{\sqrt {d_k}})V\]Multi-Head Attention
\[\mathrm {MultiHead}(Q, K, V) = \mathrm {Concate(head_1, ...., head_h)}W^O\]where $\mathrm {head_i}$ = $\mathrm {Attention}(QW_i^Q, KW_i^K, VW_i^V)$