論文筆記 Bidirectional Attention Flow for Machine Comprehension

論文出處 Bidirectional Attention Flow for Machine Comprehension

Introduction

SQuAD dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Imgur

Model

Imgur

Character Embedding Layer

示意圖（來源:link）

Imgur

Word Embedding Layer

GloVe (Pennington et al., 2014)

得到兩個level的embedding之後，先做concatenate、輸入highway network，得到最終我們要的embeding sequence。

Contextual Embedding Layer

對於 Context word sequence 與 Query word sequence 都是利用 bi-directional LSTM，並將兩個方向的 output 做 concatenate，得到：

\[\mathbf H \in \mathbb R^{2d\times T}\] \[\mathbf U \in \mathbb R^{2d\times J}\]

Attention Flow Layer

Similarity Matrix $\mathbf S\in \mathbb R^{T\times J}$

\[\mathbf S_{tj} = \alpha(\mathbf H_{:t}, \mathbf U_{:j})\in \mathbb R\] \[\alpha(\mathbf{h,u}) = \mathbf{w^T(h;u;h\circ u)}\]

Context-to-query Attention

Imgur

attention wieght $\mathbf a_t\in \mathbf R^J$
\[\mathbf a_t = \mathrm {softmax}(\mathbf S_{t:})\]
attended query vector $\tilde {\mathbf U}_{:t}$
\[\tilde {\mathbf U}_{:t} = \sum_j \mathbf a_{tj}\mathbf U_{:j}\]

Query-to-context Attention

Imgur

attention wieght $\mathbf b_t\in \mathbf R$
\[\mathbf b_t = \mathrm {max}(\mathbf S_{t:})\]
attended context vector $\tilde {\mathbf h} = \sum_t \mathbf b_t \mathbf H_{:t}$
$\tilde {\mathbf h}$ 複製 $T$ times 得：
\[\tilde {\mathbf H}\in \mathbb R^{2d\times T}\]

Query-aware representation of each context word $\mathbf G_{:t}$

\[\mathbf G_{:t} = \beta(\mathbf H_t, \tilde {\mathbf U}, \tilde {\mathbf H}_t)\in \mathbb R^{d_G}\] \[\beta(\mathbf{h,\tilde u, \tilde h}) = [\mathbf{h;\tilde u;h\circ \tilde u;h \circ \tilde h}]\in \mathbb R^{8d\times T}\]

Modeling Layer

將 $\mathbf G$ input 入 bi-LSTM，得到 $\mathbf M \in \mathbb R^{2d\times T}$。 $\mathbf M$ 可當作是某個文章中的字考量前後文以及Query之後的文字表示。

Output Layer

Start

\[\mathbf p^1 = \mathrm {softmax}(\mathbf {w_{p^1}^T[G;M]}), \quad \mathbf {w_{p_1}}\in \mathbb R^{10d}\]

End

將 $\mathbf M$ input 入 bi-LSTM，得到 $\mathbf M^2 \in \mathbb R^{2d\times T}$ $\mathbf p^2 = \mathrm {softmax}(\mathbf {w_{p^2}^T[G;M^2]})$

Loss

Training:

\[L(\theta) = \frac 1 N \sum_{i}^{N}\log(\mathbf p_{y_i^1}^1) + \log(\mathbf p_{y_i^2}^2)\]

$\mathbf p_k$ 代表第k個位置的機率。

Test:

Answer span $(k, l)$, where $k\leq l$。選擇其中 $\mathbf p_k^1\mathbf p_l^2$ 為機率相乘最大者。

而這件事理論上用DP可以在linear的時間複雜度完成。