論文筆記 Bidirectional Attention Flow for Machine Comprehension

 · 1 mins read

論文出處 Bidirectional Attention Flow for Machine Comprehension

Introduction

SQuAD dataset

The Stanford Question Answering Dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Imgur

Model

Imgur

Character Embedding Layer

示意圖(來源:link

Imgur

Word Embedding Layer

GloVe (Pennington et al., 2014)

得到兩個level的embedding之後,先做concatenate、輸入highway network,得到最終我們要的embeding sequence。

Contextual Embedding Layer

對於 Context word sequence 與 Query word sequence 都是利用 bi-directional LSTM,並將兩個方向的 output 做 concatenate,得到:

\[\mathbf H \in \mathbb R^{2d\times T}\] \[\mathbf U \in \mathbb R^{2d\times J}\]

Attention Flow Layer

Similarity Matrix $\mathbf S\in \mathbb R^{T\times J}$

\[\mathbf S_{tj} = \alpha(\mathbf H_{:t}, \mathbf U_{:j})\in \mathbb R\] \[\alpha(\mathbf{h,u}) = \mathbf{w^T(h;u;h\circ u)}\]

Context-to-query Attention

Imgur

  • attention wieght $\mathbf a_t\in \mathbf R^J$

    \[\mathbf a_t = \mathrm {softmax}(\mathbf S_{t:})\]
  • attended query vector $\tilde {\mathbf U}_{:t}$

    \[\tilde {\mathbf U}_{:t} = \sum_j \mathbf a_{tj}\mathbf U_{:j}\]

Query-to-context Attention

Imgur

  • attention wieght $\mathbf b_t\in \mathbf R$

    \[\mathbf b_t = \mathrm {max}(\mathbf S_{t:})\]
  • attended context vector $\tilde {\mathbf h} = \sum_t \mathbf b_t \mathbf H_{:t}$

  • $\tilde {\mathbf h}$ 複製 $T$ times 得:

    \[\tilde {\mathbf H}\in \mathbb R^{2d\times T}\]

Query-aware representation of each context word $\mathbf G_{:t}$

\[\mathbf G_{:t} = \beta(\mathbf H_t, \tilde {\mathbf U}, \tilde {\mathbf H}_t)\in \mathbb R^{d_G}\] \[\beta(\mathbf{h,\tilde u, \tilde h}) = [\mathbf{h;\tilde u;h\circ \tilde u;h \circ \tilde h}]\in \mathbb R^{8d\times T}\]

Modeling Layer

將 $\mathbf G$ input 入 bi-LSTM,得到 $\mathbf M \in \mathbb R^{2d\times T}$。 $\mathbf M$ 可當作是某個文章中的字考量前後文以及Query之後的文字表示。

Output Layer

Start

\[\mathbf p^1 = \mathrm {softmax}(\mathbf {w_{p^1}^T[G;M]}), \quad \mathbf {w_{p_1}}\in \mathbb R^{10d}\]

End

將 $\mathbf M$ input 入 bi-LSTM,得到 $\mathbf M^2 \in \mathbb R^{2d\times T}$ \(\mathbf p^2 = \mathrm {softmax}(\mathbf {w_{p^2}^T[G;M^2]})\)

Loss

Training:

\[L(\theta) = \frac 1 N \sum_{i}^{N}\log(\mathbf p_{y_i^1}^1) + \log(\mathbf p_{y_i^2}^2)\]

$\mathbf p_k$ 代表第k個位置的機率。

Test:

Answer span $(k, l)$, where $k\leq l$。選擇其中 $\mathbf p_k^1\mathbf p_l^2$ 為機率相乘最大者。

而這件事理論上用DP可以在linear的時間複雜度完成。