論文出處 Bidirectional Attention Flow for Machine Comprehension
Introduction
SQuAD dataset
The Stanford Question Answering Dataset
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Model
Character Embedding Layer
示意圖(來源:link)
Word Embedding Layer
GloVe (Pennington et al., 2014)
得到兩個level的embedding之後,先做concatenate、輸入highway network,得到最終我們要的embeding sequence。
Contextual Embedding Layer
對於 Context word sequence 與 Query word sequence 都是利用 bi-directional LSTM,並將兩個方向的 output 做 concatenate,得到:
\[\mathbf H \in \mathbb R^{2d\times T}\] \[\mathbf U \in \mathbb R^{2d\times J}\]Attention Flow Layer
Similarity Matrix $\mathbf S\in \mathbb R^{T\times J}$
\[\mathbf S_{tj} = \alpha(\mathbf H_{:t}, \mathbf U_{:j})\in \mathbb R\] \[\alpha(\mathbf{h,u}) = \mathbf{w^T(h;u;h\circ u)}\]Context-to-query Attention
attention wieght $\mathbf a_t\in \mathbf R^J$
\[\mathbf a_t = \mathrm {softmax}(\mathbf S_{t:})\]attended query vector $\tilde {\mathbf U}_{:t}$
\[\tilde {\mathbf U}_{:t} = \sum_j \mathbf a_{tj}\mathbf U_{:j}\]
Query-to-context Attention
attention wieght $\mathbf b_t\in \mathbf R$
\[\mathbf b_t = \mathrm {max}(\mathbf S_{t:})\]attended context vector $\tilde {\mathbf h} = \sum_t \mathbf b_t \mathbf H_{:t}$
$\tilde {\mathbf h}$ 複製 $T$ times 得:
\[\tilde {\mathbf H}\in \mathbb R^{2d\times T}\]
Query-aware representation of each context word $\mathbf G_{:t}$
\[\mathbf G_{:t} = \beta(\mathbf H_t, \tilde {\mathbf U}, \tilde {\mathbf H}_t)\in \mathbb R^{d_G}\] \[\beta(\mathbf{h,\tilde u, \tilde h}) = [\mathbf{h;\tilde u;h\circ \tilde u;h \circ \tilde h}]\in \mathbb R^{8d\times T}\]Modeling Layer
將 $\mathbf G$ input 入 bi-LSTM,得到 $\mathbf M \in \mathbb R^{2d\times T}$。 $\mathbf M$ 可當作是某個文章中的字考量前後文以及Query之後的文字表示。
Output Layer
Start
\[\mathbf p^1 = \mathrm {softmax}(\mathbf {w_{p^1}^T[G;M]}), \quad \mathbf {w_{p_1}}\in \mathbb R^{10d}\]End
將 $\mathbf M$ input 入 bi-LSTM,得到 $\mathbf M^2 \in \mathbb R^{2d\times T}$ \(\mathbf p^2 = \mathrm {softmax}(\mathbf {w_{p^2}^T[G;M^2]})\)
Loss
Training:
\[L(\theta) = \frac 1 N \sum_{i}^{N}\log(\mathbf p_{y_i^1}^1) + \log(\mathbf p_{y_i^2}^2)\]$\mathbf p_k$ 代表第k個位置的機率。
Test:
Answer span $(k, l)$, where $k\leq l$。選擇其中 $\mathbf p_k^1\mathbf p_l^2$ 為機率相乘最大者。
而這件事理論上用DP可以在linear的時間複雜度完成。