Model Architecture

Transformer总体结构如下，encoder和decoder结构都是堆叠self-attention and point-wise, fully connected layers。

Encoder

Encoder由$N=6$个一模一样的层（EncoderLayer）组成。

Decoder

decoder同样由$N=6$个一模一样的层（encoder layer）组成。

Attention

attention函数可以被描述为 mapping a query and a set of key-value pairs to an output，其中query, keys, values, and output都是向量。output是values的加权求和，其中每个value的权重是通过query with the corresponding key的compatibility function计算得到。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. 当仅有一个 attention head，平均化抑制了这一点。

Where the projections are parameter matrices $W^Q_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^V_i \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$。在本工作中，我们使用了$h=8$个平行attention layers, or heads。对于其中每一个，我们使用了$d_k=d_v=d_{\text{model}}/h=64$。由于each head的维度降低了，所以总的计算量与full dimensionality的single-head attention相同。

Applications of Attention in our Model

1. 在“encoder-decoder attention” layers中，queries来自于之前的decoder layer， memory keys and values 来自encoder的输出。This allows every position in the decoder to attend over all positions in the input sequence. 这模仿了sequence-to-sequence模型中典型的encoder-decoder attention机制。
2. encoder中的self-attention layers. 这里的self-attention layers中，所有的 keys, values and queries均来自于上一层的输出。Each position in the encoder can attend to all positions in the previous layer of the encoder.
3. decoder中的self-attention layers. self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position（up to and including：直到并包括）。我们需要防止信息在decoder中向左流动，以保持自回归（auto-regressive）特性。我们在scaled dot-product attention中实现了这一点，通过屏蔽Softmax输入中对应于非法连接的所有值（设置为$-\infty$）。

Positional Encoding

Below the positional encoding will add in a sine wave based on position. The frequency and offset of the wave is different for each dimension.

Training

Training Data and Batching

Sentence pairs were batched together by approximate sequence length. 每个训练批次包含一组句子对，其中包含大约25000个source tokens和25000个target tokens。

Regularization（Label Smoothing）

Here we can see an example of how the mass is distributed to the words based on confidence.

Label smoothing actually starts to penalize the model if it gets very confident about a given choice.

A Real World Example

We will load the dataset using torchtext and spacy for tokenization.

Question

• source长短不一而无法形成batch，因此引入了pad。将source mask传入到encoder中，让attention在计算$\mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})$时，pad位置的值不起作用。