fairseq翻译任务解读

最近需要用到fairseq框架中的翻译任务，这里记录一下。

从实战开始

首先下载翻译模型：

mkdir -p model
cd model
wget https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2

bunzip2 wmt16.en-de.joined-dict.transformer.tar.bz2
tar -xvf wmt16.en-de.joined-dict.transformer.tar

解压后的文件如下：

./model
├── wmt16.en-de.joined-dict.transformer
│   ├── bpecodes
│   ├── dict.de.txt
│   ├── dict.en.txt
│   └── model.pt

然后调用翻译模型：

from fairseq.models.transformer import TransformerModel

def load_fairseq_tm(path, device):
    # data_name_or_path和bpe_codes可以省略
    model = TransformerModel.from_pretrained(
        path,
        checkpoint_file='model.pt',
        data_name_or_path='.',
        bpe='subword_nmt',
        bpe_codes=path+"/bpecode"
    )
    model.cuda(device=device)
    return model

bt_model = load_fairseq_tm('./model/wmt16.en-de.joined-dict.transformer', 0)
output = bt_model.translate('Hello world!')
print(output)

得到的结果是：

1	Hallo Welt !

那么，这个过程都干了哪些事呢？我们对此进行了详细的分析。

BPE

在分析之前，我们先介绍一下BPE算法。以下内容来源于NMT Tutorial 3扩展e第2部分. Subword

前言

按照布隆菲尔德的理论，词被认为是人类语言中能自行独立存在的最小单位，是“最小自由形式”。因此，对西方语言做NLP时，以词为基石是一个很自然的想法。

但是将某个语言的词穷举出来是不太现实的。首先，名词、动词、形容词、副词这四种属于开放词类，总会有新的词加入进来。其次，网络用语会创造出更多新词，或者为某个词给出不规则的变形。最后，以德语为代表的语言通常会将几个基本词组合起来，形成一个复合词，例如Abwasserbehandlungsanlage “污水处理厂”可以被细分为Abwasser、behandlungs和Anlage三个部分。

即便是存在某个语言能获得其完整词表，词表的数量也会非常庞大，使得模型复杂度很高，训练起来很难。对于以德语、西班牙语、俄语为代表的屈折语，也会存在类似的问题（例如西班牙语动词可能有80种变化）。

因此，在机器翻译等任务中，从训练语料构造词表时，通常会过滤掉出现频率很低的单词，并将这些单词统一标记为UNK（Unknown）。根据Zipf定律，这种做法能筛掉很多不常见词，简化模型结构，而且可以起到部分防止过拟合的作用。此外，模型上线做推断时，也有很大概率会遇到在训练语料里没见过的词，这些词也会被标为UNK。所有不在词表里被标记为UNK的词，通常被称作集外词（Out Of Vocabulary，OOV）或者未登录词。

对未登录词的处理是机器翻译领域里一个十分重要的问题。sennrich2016认为，对于某些未登录词的翻译可能是”透明“的，包括

命名实体，例如人名、地名等。对于这些词，如果目标语言和源语言的字母体系相同，可能可以直接抄写；如果不同，需要做些转写。例如将英语的Barack Obama转写成俄语的Барак Обама
借词，可以通过字母级别的翻译做到，例如将claustrophobia翻译成德语的Klaustrophobie和俄语的Клаустрофобия
词素复杂的词，例如通过组合或者屈折变化得到的词，可以将其进一步拆分为词素，通过分别翻译各个词素的得到结果。例如将英语的solar system翻译成德语的Sonnensystem或者匈牙利语的Naprendszer

因此，将词拆分为更细粒度的subword，可以有助于处理OOV问题。另外传统tokenization方法不利于模型学习词缀之间的关系。E.g. 模型学到的“old”, “older”, and “oldest”之间的关系无法泛化到“smart”, “smarter”, and “smartest”。

由此，sennrich2016文章还同时指出使用一种称为“比特对编码”（Byte Pair Encoding——BPE）的算法可以将词拆分为更细粒度的subword。但是BPE对单词的划分是纯基于统计的，得到的subword所蕴含的词素，或者说形态学信息，并不明显。除此BPE之外，Morfessor是一种基于形态学的分词器，它使用的是无监督学习的方法，能达到不错的准确率。最后，2016年FAIR提出的一种基于subword的词嵌入表示方法fastText。但是本文只关注BPE算法，其余可以参考文章NMT Tutorial 3扩展e第2部分. Subword。

除去subword方法以外，还可以将词拆成字符，为每个字符训练一个字符向量。这种方法很直观，也很有效，不过无需太费笔墨来描述。关于字符向量的优秀工作，可以参考Bojanowski2017的“相关工作”部分。

原理

BPE算法[gage1994]的本质实际上是一种数据压缩算法。数据压缩的一般做法都是将常见比特串替换为更短的表示方法，而BPE也不例外。更具体地说，BPE是找出最常出现的相邻字节对，将其替换成一个在原始数据里没有出现的字节，一直循环下去，直到找不到最常出现的字节对或者所有字节都用光了为止。后期使用时需要一个替换表来重建原始数据。例如，对”lwlwlwlwrr”使用BPE算法，会先把lw替换为a，得到”aaaarr”，然后把”aa”替换为”b”，得到”bbrr”。此时所有相邻字节对”bb”、”br”、”rr”的出现次数相等，迭代结束，输出替换表{“b” -> “aa”, “a” -> “lw”}。

优点：可以有效地平衡词汇表大小和步数(编码句子所需的token数量)。
缺点：基于贪婪和确定的符号替换，不能提供带概率的多个分片结果。

算法

准备足够大的训练语料
确定期望的subword词表大小
将单词拆分为字符序列并在末尾添加后缀“ </ w>”，统计单词频率。本阶段的subword的粒度是字符。例如，“ low”的频率为5，那么我们将其改写为“ l o w </ w>”：5
统计每一个连续字节对的出现频率，选择最高频者合并成新的subword
重复第4步直到达到第2步设定的subword词表大小或下一个最高频的字节对出现频率为1

停止符”</w>”的意义在于表示subword是词后缀。举例来说：”st”字词不加”</w>”可以出现在词首如”st ar”，加了”</w>”表明改字词位于词尾，如”wide st</w>”，二者意义截然不同。

每次合并后词表可能出现3种变化：

+1，表明加入合并后的新字词，同时原来的2个子词还保留（2个字词不是完全同时连续出现）
+0，表明加入合并后的新字词，同时原来的2个子词中一个保留，一个被消解（一个字词完全随着另一个字词的出现而紧跟着出现）
-1，表明加入合并后的新字词，同时原来的2个子词都被消解（2个字词同时连续出现）

实际上，随着合并的次数增加，词表大小通常先增加后减小。

例子

输入：

1	{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

Iter 1, 最高频连续字节对”e”和”s”出现了6+3=9次，合并成”es”。输出：

1	{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

Iter 2, 最高频连续字节对”es”和”t”出现了6+3=9次, 合并成”est”。输出：

1	{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}

Iter 3, 以此类推，最高频连续字节对为”est”和”</w>” 输出：

1	{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}

……

Iter n, 继续迭代直到达到预设的subword词表大小或下一个最高频的字节对出现频率为1。

BPE算法的核心学习过程可以写做如下Python代码。

import re, collections

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

vocab = {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
num_merges = 1000
for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(best)

# print output
# ('e', 's')
# ('es', 't')
# ('est', '</w>')
# ('l', 'o')
# ('lo', 'w')
# ('n', 'e')
# ('ne', 'w')
# ('new', 'est</w>')
# ('low', '</w>')
# ('w', 'i')
# ('wi', 'd')
# ('wid', 'est</w>')
# ('low', 'e')
# ('lowe', 'r')
# ('lower', '</w>')

编码和解码

编码

在之前的算法中，我们已经得到了subword的词表（即常说的code文件），且该词表已经按照频率从高到低进行排序了。那么我们就可以对单词进行编码（下文的subword-nmt小节中，利用得到的code.file对./en.txt进行编码得到result1.txt就利用了当前要介绍的编码过程）。

以单词“where”为例，首先按照字符拆分开，然后查找code文件，逐对合并，优先合并频率靠前的字符对。85 319 9 15 表示在该字符对在code文件中的频率排名。

根据我自己的实验，e</w>可以直接合并，所以这里的频率排名直接是1，即使code文件中无e </w>。

如果仍然有子字符串没被替换但所有token都已迭代完毕，则有两种做法，一种是将剩余的子词替换为特殊token，如。另外一种比较常用，由于未登录词通常会被这种方法拆成若干个subword，因此通常会向不在原来词表的subword后面写明一个分隔符，通常是@@。例如，假如要编码的词是said

若这个词的子词s a在词表中，但是sa i和i d</w>不在词表里，encode只能得到('sa', 'i', 'd')，那么输出会是sa@@ i@@ d。
若子词s a和i d在词表中，但是sa i和i d</w>不在词表里，那么输出仍然是sa@@ i@@ d。
若子词s a和i d</w>在词表中，但是sa id</w>不在词表里，那么输出是sa@@ id。
若子词s a，i d</w>和sa id</w>在词表中，那么输出是said。
若仅有sa id</w>在词表中，那么输出是s@@ a@@ i@@ d。
若仅有i d</w>和sa id</w>在词表中，那么输出是s@@ a@@ id。

编码的计算量很大。在实践中，我们可以pre-tokenize所有单词，并在词典中保存单词tokenize的结果。

解码

将所有的tokens拼在一起，如果有@@符号则去除（下文的后处理小节中self.remove_bpe函数，就利用了当前小节要介绍的解码过程）。

例子：

# 编码序列
[“the</w>”, “high”, “est</w>”, “moun”, “tain</w>”]

# 解码序列
“the</w> highest</w> mountain</w>”

subword-nmt

安装subword-nmt

1	pip install subword-nmt

命令行接口

先准备一个语料库。例如：链接：https://pan.baidu.com/s/1BAWDeAw5QYXS7xCrLIBIAw，提取码：kfy9

生成codevocabulary和：

1	subword-nmt learn-joint-bpe-and-vocab -i ./en.txt -o ./code.file --write-vocabulary voc.txt

说明：

-i后面的参数是输入文件名
-o后面是输出的code文件文件名
—write-vocabulary后面是输出字典的文件名

其他参数说明：

usage: subword-nmt learn-joint-bpe-and-vocab [-h] --input PATH [PATH ...]
                                             --output PATH [--symbols SYMBOLS]
                                             [--separator STR]
                                             --write-vocabulary PATH
                                             [PATH ...] [--min-frequency FREQ]
                                             [--total-symbols] [--verbose]

learn BPE-based word segmentation

optional arguments:
  -h, --help            show this help message and exit
  --input PATH [PATH ...], -i PATH [PATH ...]
                        Input texts (multiple allowed).
  --output PATH, -o PATH
                        Output file for BPE codes.
  --symbols SYMBOLS, -s SYMBOLS
                        Create this many new symbols (each representing a
                        character n-gram) (default: 10000))
  --separator STR       Separator between non-final subword units (default:
                        '@@'))
  --write-vocabulary PATH [PATH ...]
                        Write to these vocabulary files after applying BPE.
                        One per input text. Used for filtering in apply_bpe.py
  --min-frequency FREQ  Stop if no symbol pair has frequency >= FREQ (default:
                        2))
  --total-symbols, -t   subtract number of characters from the symbols to be
                        generated (so that '--symbols' becomes an estimate for
                        the total number of symbols needed to encode text).
  --verbose, -v         verbose mode.

我们可以看一下生成的code.file和voc.txt。

code.file:

#version: 0.2
t h
i n
th e</w>
a n
r e
t i
e n
o n
an d</w>
e r
···

voc.txt部分内容：

···
ary 14
apart 14
conscientiously 14
flight 14
association 14
represent 14
th 14
activity 14
standard 14
call 14
jia 14
solid 14
seven 14
···

这里需要注意的是，code.file文件一共有10001行，而voc.txt文件一共有8760行，并且voc.txt含有一部分带有@@的行。

那么这里的code.file和voc.txt有什么关系呢？我们继续进行探索。

安装完subword-nmt之后，我们可以在终端输入subword-nmt -h，得到内容如下：

(base) PS E:\Working\learn_bpe> subword-nmt -h
usage: subword-nmt [-h] {learn-bpe,apply-bpe,get-vocab,learn-joint-bpe-and-vocab} ...

subword-nmt: unsupervised word segmentation for neural machine translation and text generation

positional arguments:
  {learn-bpe,apply-bpe,get-vocab,learn-joint-bpe-and-vocab}
                        command to run. Run one of the commands with '-h' for more info.

                        apply-bpe: apply given BPE operations to input text.
                        learn-joint-bpe-and-vocab: executes recommended workflow for joint BPE.

optional arguments:
  -h, --help            show this help message and exit

也就是说，subword-nmt有learn-bpe,apply-bpe,get-vocab,learn-joint-bpe-and-vocab方法，继续输入subword-nmt learn-bpe -h可以查看子函数的用法。

详细的探索这几个函数的用法之后，可以发现如下结论。

learn-joint-bpe-and-vocab其实是三条指令的合体。

subword-nmt learn-joint-bpe-and-vocab -i ./en.txt -o ./code.file --write-vocabulary voc.txt

# 上述指令等价于
subword-nmt learn-bpe -i ./en.txt  -o ./code.file
subword-nmt apply-bpe -i ./en.txt -c ./code.file -o result1.txt
subword-nmt get-vocab -i ./result1.txt -o voc.txt

get-vocab函数会对文件中出现的单词以及对应的频率进行统计，得到voc.txt文件，该过程不需要code.file文件。
code.file和voc.txt关系是：首先利用learn-bpe从en.txt文件中学习bpe分词规则，然后利用该规则对en.txt编码，统计编码之后文件的词语和词频得到voc.txt文件。所以两者并不是意义对应的关系，而且哪个文件行数更多也不一定。

使用bpe编码

在使用learn-bpe功能得到code后，可以使用apply-bpe来对语料进行编码。值得注意的是，这里解码时并不需要用到voc.txt

1	subword-nmt apply-bpe -i ./en.test.txt -c ./code.file -o result.txt

说明：

-i 后面是输入的待解码文件名
-c 后面跟着learn-bpe步骤得到的code文件
-o 结果输出文件

我们可以查看结果，就会自动根据bpe生成的code文件对语料进行分割。

beijing , 1 mar ( xinhua ) -- tian feng@@ shan , former heilongjiang governor who is 5@@ 9 years old , was appointed minister of land and resources today .
tian feng@@ shan , who was born in zhao@@ yuan county , heilongjiang province , took part in work since july 196@@ 1 and joined the cpc in march 1970 .
this should be a natural process set off by economic development ; the " third tier construction " of the 1960s involving fac@@ tory relocation was something entirely different .
we must also realize however that from the angle of changing the pattern of resource allocation , we have not yet made the big breakthrough in reform .
with regard to joining the world trade organization , one recent reaction has been blind optim@@ ism and the belief that china will profit whatever it does .
since these areas where objective conditions are not particularly good can achieve this , other areas where conditions are better can naturally do the same .
the objective trend of globalization is calling for international cooperation on a global scale , and a global cooperation has far exceeded the scope of the economy .

解码

那么我们的文件怎么恢复到bpe编码之前的结果呢？

只需要执行下面指令即可。

sed -r 's/(@@ )|(@@ ?$)//g' result.txt

# 将解码结果保存到文件中
sed -r 's/(@@ )|(@@ ?$)//g' result.txt > restore.txt

我们恢复之后的结果是：

beijing , 1 mar ( xinhua ) -- tian fengshan , former heilongjiang governor who is 59 years old , was appointed minister of land and resources today .
tian fengshan , who was born in zhaoyuan county , heilongjiang province , took part in work since july 1961 and joined the cpc in march 1970 .
this should be a natural process set off by economic development ; the " third tier construction " of the 1960s involving factory relocation was something entirely different .
we must also realize however that from the angle of changing the pattern of resource allocation , we have not yet made the big breakthrough in reform .
with regard to joining the world trade organization , one recent reaction has been blind optimism and the belief that china will profit whatever it does .
since these areas where objective conditions are not particularly good can achieve this , other areas where conditions are better can naturally do the same .
the objective trend of globalization is calling for international cooperation on a global scale , and a global cooperation has far exceeded the scope of the economy .

Python接口

可以用命令pip install subword-nmt安装包subword-nmt以后，可以使用如下代码得到BPE的分词结果，以及将BPE的分词方法用到测试语料上。

from subword_nmt import apply_bpe, learn_bpe
# 得到分词结果，写到../data/toy_bpe.txt这个文件中
with open('../data/toy_vocab.txt', 'r', encoding='utf-8') as in_file, \
        open('../data/toy_bpe.txt', 'w+', encoding='utf-8') as out_file:
    # 得到分词结果，写到../data/toy_bpe.txt这个文件中
    # 1500是最后BPE词表大小，is_dict说明输入文件是个词表文件，格式为"<单词> <次数>"
    learn_bpe.learn_bpe(in_file, out_file, 1500, verbose=True, is_dict=True)

# 读取../data/toy_bpe.txt分词结果，并作用于../data/bpe_test_raw.txt中的文本，最后写到../data/bpe_test_processed.txt文件中
with open('../data/bpe_test_raw.txt', 'r', encoding='utf-8') as in_file, \
        open('../data/bpe_test_processed.txt', 'w+', encoding='utf-8') as out_file, \
        open('../data/toy_bpe.txt', 'r', encoding='utf-8') as code_file:
    # 构造BPE词表
    bpe = apply_bpe.BPE(code_file)
    for line in in_file:
        # 使用BPE分词
        out_file.write(bpe.process_line(line))

总结

subword可以平衡词汇量和对未知词的覆盖。极端的情况下，我们只能使用26个token（即字符）来表示所有英语单词。一般情况，建议使用16k或32k子词足以取得良好的效果，Facebook RoBERTa甚至建立的多达50k的词表。

模型加载

补充完毕BPE算法的原理之后，我们开始对该源码进行分析。首先来看模型加载部分。

模型加载的核心函数为fairseq/hub_utils.py: from_pretrained函数。在该函数的会调用checkpoint_utils.load_model_ensemble_and_task函数。该函数不仅加载了模型权重，而且会初始化task。我们重点关注这个初始化过程。初始化该task时，默认会初始化为translation任务。

然后跳入函数fairseq/tasks/translation.py中，可以看到在setup_task函数中，会读取model/wmt16.en-de.joined-dict.transformer/dict.en.txt和model/wmt16.en-de.joined-dict.transformer/dict.de.txt文件，然后放到fairseq.tasks.translation.TranslationTask的src_dict和tgt_dict中。另外值得注意的是fairseq.data.dictionary.Dictionary的实例，在实例化该类的时候，会在最前面加上bos="<s>", pad="<pad>", eos="</s>", unk="<unk>"，因此虽然这两个txt文件中都有32764行（两个文件内容一模一样），最终都会有32768行，与翻译模型的输出维度一致。

加载模型并初始化task之后，from_pretrained函数接着实例化了hub_utils.GeneratorHubInterface。我们接着看该类在实例化的时候会做些什么。

从下图可以看到该初始化函数依次做了：从task中创建src_dict和tgt_dict属性（与fairseq.tasks.translation.TranslationTask的src_dict和tgt_dict一致），然后加载了align_dict、tokenizer、bpe。

我们这里重点关注一下bpe的初始化过程，单步调试进入到fairseq/registry.py文件后，可以发现fairseq支持的所有bpe有：dict_keys(['bytes', 'gpt2', 'hf_byte_bpe', 'bert', 'characters', 'fastbpe', 'byte_bpe', 'sentencepiece', 'subword_nmt'])。我们这里在初始化模型时传入了bpe='subword_nmt参数，所以我们重点关注一下subword_nmt的初始化方式。

该初始化过程的详细过程在fairseq/data/encoders/subword_nmt_bpe.py文件的SubwordNMTBPE类的__init__函数中。从下图中可以看出，该函数会读取args.bpe_codes对应的文件，也就是'./model/wmt16.en-de.joined-dict.transformer/bpecodes'文件，用于实例化subword_nmt.apply_bpe.BPE得到对应self.bpe，同时SubwordNMTBPE类还有对应的encode和decode函数。

介绍完BPE的初始化，我们接着回到hub_utils.GeneratorHubInterface类中，此时的self.bpe的类型为fairseq.data.encoders.subword_nmt_bpe.SubwordNMTBPE。

由此，翻译模型的模型加载部分已经介绍完了。

前向推理

从上面的调用关系来看，翻译模型进行推理的函数是translate。我们调试进入该函数，发现该函数位于/root/anaconda3/lib/python3.8/site-packages/fairseq/hub_utils.py文件中。核心代码如下：

可以看到，翻译时需要经过三个关键步骤：encode、generate和decode。这里我们先关注预处理和后处理步骤。关键代码如下：

可以看出来预处理主要流程为分词->BPE->binarize，后处理的主要步骤是string->去除bpe->去分词。

预处理

我们接着来看预处理过程。在使用该模型的时候，并没有用到分词，而是直接使用了BPE的方式，所以我们跳过self.tokenize函数，首先来看self.apply_bpe函数。该函数会调用SubwordNMTBPE.encode of <fairseq.data.encoders.subword_nmt_bpe.SubwordNMTBPE>，我们这里先不管这个函数干了啥，先介绍它的输入输出。其输入为：'Hello world!'，输出为'H@@ ello world@@ !'。

接着我们来看self.binarize，它的输入是'H@@ ello world@@ !'，输出是tensor([ 190, 7016, 29382, 88, 2])。该函数会调用Dictionary.encode_line of <fairseq.data.dictionary.Dictionary>。查询前面的src_dict，将字符串映射到唯一ID上（ID简单理解为model/wmt16.en-de.joined-dict.transformer/dict.en.txt中的行数+4）。

模型推理

介绍完预处理流程，我们来看下网络结构，网络结构如下（因为该网络结构很长，所以只摘出来关键部分）。

GeneratorHubInterface(
  (models): ModuleList(
    (0): TransformerModel(
      (encoder): TransformerEncoder(
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(32768, 1024, padding_idx=1)
        (embed_positions): SinusoidalPositionalEmbedding()
        (layers): ModuleList(
          (0): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=1024, out_features=4096, bias=True)
            (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          )
          # 省略(1~5)TransformerEncoderLayer
        )
      )
      (decoder): TransformerDecoder(
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(32768, 1024, padding_idx=1)
        (embed_positions): SinusoidalPositionalEmbedding()
        (layers): ModuleList(
          (0): TransformerDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=1024, out_features=4096, bias=True)
            (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          )
          # 省略(1~5)TransformerDecoderLayer
        )
        (output_projection): Linear(in_features=1024, out_features=32768, bias=False)
      )
    )
  )
)

总结概括一下该结构，如下。

TransformerEncoder(
    FairseqDropout(), 
    Embedding(32768, 1024, padding_idx=1), 
    SinusoidalPositionalEmbedding(), 
    6个TransformerEncoderLayer
)
TransformerDecoder(
    FairseqDropout(), 
    Embedding(32768, 1024, padding_idx=1), 
    SinusoidalPositionalEmbedding(), 
    6个TransformerDecoderLayer, 
    Linear(in_features=1024, out_features=32768, bias=False)
)

也就是说，在该模型中，使用了torch.nn.Embedding层对输入进行了Embedding并学习。

接着我们看下模型推理部分——generate函数。

该函数首先会调用FairseqTask.build_generator of <fairseq.tasks.translation.TranslationTask>函数，并传入gen_args参数（该参数中包含了beam）。在该函数会执行search_strategy = search.BeamSearch(self.target_dictionary)函数实例化BeamSearch（使用到了model/wmt16.en-de.joined-dict.transformer/dict.de.txt），并与模型一块放到SequenceGenerator中进行实例化，而实际进行推理时也是调用的SequenceGenerator.generate of SequenceGenerator，同时进行模型推理+BeamSearch过程。

具体细节我们先不关注，先说下输入输出。其输入为

经过推理之后，输出结果为（下面5个结果的tokens是不一样的，这里显示不出来）：

后处理

最后，我们来看下后处理流程。后处理的对应的代码是[self.decode(hypos[0]["tokens"]) for hypos in batched_hypos]。也就是将tensor([12006, 165, 488, 88, 2], device='cuda:0')输入到self.decode函数中。该函数的主要流程是string->去除bpe->去分词。

我们先来看self.string函数，该函数与self.binarize函数相反，它会调用Dictionary.string of <fairseq.data.dictionary.Dictionary>，查询前面的tgt_dict，将ID映射回字符串（ID简单理解为model/wmt16.en-de.joined-dict.transformer/dict.de.txt中的行数+4）。它的输入为tensor([12006, 165, 488, 88, 2], device='cuda:0')，输出为'Hall@@ o Welt !'。

接着来看self.remove_bpe函数，它与self.apply_bpe函数作用相反，该函数会调用SubwordNMTBPE.decode of <fairseq.data.encoders.subword_nmt_bpe.SubwordNMTBPE>，我们这里先不管这个函数干了啥，先介绍它的输入输出。其输入为：'Hall@@ o Welt !'，输出为'Hallo Welt !'。

同样的，最后，该过程并没有调用self.detokenize，这里先不管。

训练数据准备

数据预处理

下文主要来源于WMT14 en-de翻译数据集预处理步骤

fairseq提供了一份wmt14英德数翻译据集的预处理脚本，简单结合其代码分析一下其处理步骤：

1、下载mosesdecoder。mosesdecoder的使用文档在这里

1 2	echo 'Cloning Moses github repository (for tokenization scripts)...' git clone https://github.com/moses-smt/mosesdecoder.git

2、下载subword nmt。这个开源库是用于构造bpecodes及其字典的。

1 2	echo 'Cloning Subword NMT repository (for BPE pre-processing)...' git clone https://github.com/rsennrich/subword-nmt.git

3、

SCRIPTS=mosesdecoder/scripts      # 定义SCRIPTS变量，指向mosesdecoder的脚本文件夹
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl      # 定义TOKENIZER变量，指向mosesdecoder的tokenizer.perl, 用来分词
CLEAN=$SCRIPTS/training/clean-corpus-n.perl      # 定义CLEAN变量，指向mosesdecoder的clean-corpus-n.perl，clean的主要作用是保留指定长度的数据
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl      # 定义NORM_PUNC变量，指向normalize-punctuation.perl,用来将标点符号规范化
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl      # 定义REM_NON_PRINT_CHAR变量，指向remove-non-printing-char.perl,去除语料中的非打印字符 
BPEROOT=subword-nmt/subword_nmt      # 定义BPEROOT变量，指向subword_nmt根目录。
BPE_TOKENS=40000      # 指定BPE TOKENS的数量为40000

4、

# 指定语料来源，其中包括了训练、验证、测试语料
URLS=(
    "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
    "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
    "http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz"
    "http://data.statmt.org/wmt17/translation-task/dev.tgz"
    "http://statmt.org/wmt14/test-full.tgz"
)
# 指定文件名，和上面URLS对应
FILES=(
    "training-parallel-europarl-v7.tgz"
    "training-parallel-commoncrawl.tgz"
    "training-parallel-nc-v12.tgz"
    "dev.tgz"
    "test-full.tgz"      # 只要test-full是测试集，上面四个都是训练+验证集。
)
CORPORA=(
    "training/europarl-v7.de-en"
    "commoncrawl.de-en"
    "training/news-commentary-v12.de-en"
)

5、

# This will make the dataset compatible to the one used in "Convolutional Sequence to Sequence Learning"
# https://arxiv.org/abs/1705.03122
# 如果指定参数--icml17，就将语料2替换成wmt14的语料，而不是使用wmt17的语料，这是为了和ConvS2S论文保持一致
if [ "$1" == "--icml17" ]; then
    URLS[2]="http://statmt.org/wmt14/training-parallel-nc-v9.tgz"
    FILES[2]="training-parallel-nc-v9.tgz"
    CORPORA[2]="training/news-commentary-v9.de-en"
    OUTDIR=wmt14_en_de      # 指定输出文件夹名
else
    OUTDIR=wmt17_en_de
fi

6、

src=en      # 源语言为英文
tgt=de      # 目标语言是德语
lang=en-de      # 语言对为英德
prep=$OUTDIR      # 文件夹前缀为$OUTDIR
tmp=$prep/tmp      # 文件夹$OUTDIR内有一个tmp文件夹
orig=orig      # orig=orig
dev=dev/newstest2013      # 开发集使用newstest2013

mkdir -p $orig $tmp $prep      # 递归创建上面定义的文件夹，包括orig文件夹，$OUTDIR/tmp文件夹，$OUTDIR文件夹

cd $orig      # 切换到orig文件夹中

7、

for ((i=0;i<${#URLS[@]};++i)); do      # 迭代每一个URLS
    file=${FILES[i]}
    if [ -f $file ]; then
        echo "$file already exists, skipping download"      # 如果文件之前已经下载下来了，就跳过
    else
        url=${URLS[i]}      
        wget "$url"      # 否则下载
        if [ -f $file ]; then      
            echo "$url successfully  downloaded."       # 下载完文件存在表示下载成功
        else
            echo "$url not successfully downloaded."  # 查无此人，下载失败
            exit -1
        fi
        if [ ${file: -4} == ".tgz" ]; then      # 对于.tgz格式的文件，用zxvf命令解压
            tar zxvf $file
        elif [ ${file: -4} == ".tar" ]; then      # 对于.tar格式的文件，用xvf命令解压
            tar xvf $file
        fi
    fi
done
cd ..

执行完毕之后，$OUTDIR文件夹存放的内容有：

./wmt17_en_de
├── orig  # 原始数据集的tgz文件+解压之后的结果
│   ├── dev
│   ├── test-full
│   └── training
└── tmp

8、重点来了

echo "pre-processing train data..."      # 预处理训练语料
for l in $src $tgt; do
    rm $tmp/train.tags.$lang.tok.$l      # 如果存在，先移除
    for f in "${CORPORA[@]}"; do      
        cat $orig/$f.$l | \
            perl $NORM_PUNC $l | \      # 先标准化符号
            perl $REM_NON_PRINT_CHAR | \      # 移除非打印字符
            perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l  # 分词
    done
done

echo "pre-processing test data..."      # 预处理测试语料
for l in $src $tgt; do
    if [ "$l" == "$src" ]; then      
        t="src"
    else
        t="ref"
    fi
    grep '<seg id' $orig/test-full/newstest2014-deen-$t.$l.sgm | \      #这一块操作没看懂
        sed -e 's/<seg id="[0-9]*">\s*//g' | \      
        sed -e 's/\s*<\/seg>\s*//g' | \
        sed -e "s/\’/\'/g" | \
    perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l      # 分词
    echo ""
done

执行完毕之后，得到的文件是：

./wmt17_en_de
├── orig  # 原始数据集的tgz文件+解压之后的结果
│   ├── dev
│   ├── test-full
│   └── training
└── tmp
    ├── test.de
    ├── test.en
    ├── train.tags.en-de.tok.de
    └── train.tags.en-de.tok.en

预处理完毕之后，test.en的其中一条语句为They are not even 100 metres apart : On Tuesday , the new B 33 pedestrian lights in Dorfparkplatz in Gutach became operational - within view of the existing Town Hall traffic lights .。可以看出来，标点符号已经和字母分开了。

9、

echo "splitting train and valid..."      # 划分训练集和验证集
for l in $src $tgt; do
    awk '{if (NR%100 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/valid.$l      # 从训练集中，每100个句子抽1个句子作为验证集
    awk '{if (NR%100 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
done

执行完毕之后，得到的文件结构是：

./wmt17_en_de
├── orig  # 原始数据集的tgz文件+解压之后的结果
│   ├── dev
│   ├── test-full
│   └── training
└── tmp
    ├── test.de
    ├── test.en
    ├── train.de
    ├── train.en
    ├── train.tags.en-de.tok.de
    ├── train.tags.en-de.tok.en
    ├── valid.de
    └── valid.en

10、

TRAIN=$tmp/train.de-en      # 训练语料（包含src和tgt)
BPE_CODE=$prep/code      # BPECODE文件
rm -f $TRAIN      # train.de-en如果存在就删掉
for l in $src $tgt; do      
    cat $tmp/train.$l >> $TRAIN  # 其实就是简单地将src语料和tgt语料按顺序放到一个文件中，方便后面联合学习bpe
done

echo "learn_bpe.py on ${TRAIN}..."      # 学习BPE
python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE       # 这里是将源语言和目标语言的语料联合起来学BPE的，因为我们用的是train.de-en

for L in $src $tgt; do
    for f in train.$L valid.$L test.$L; do      # 用学到的bpecode应用到三份语料中（训练语料，验证语料，测试语料）
        echo "apply_bpe.py to ${f}..."
        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f      # 输出到tmp中对应的文件，以bpe.作为前缀
    done
done

在执行learn_bpe.py的时候，刚开始的速度特别慢，但是速度会越来越快，最终得到code文件。

执行完毕之后，得到的文件结构是：

./wmt17_en_de
├── code
├── orig  # 原始数据集的tgz文件+解压之后的结果
│   ├── dev
│   ├── test-full
│   └── training
└── tmp
    ├── bpe.test.de
    ├── bpe.test.en
    ├── bpe.train.de
    ├── bpe.train.en
    ├── bpe.valid.de
    ├── bpe.valid.en
    ├── test.de
    ├── test.en
    ├── train.de
    ├── train.de-en
    ├── train.en
    ├── train.tags.en-de.tok.de
    ├── train.tags.en-de.tok.en
    ├── valid.de
    └── valid.en

11、

1
2

perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250      # 按照长度对训练语料和验证语料进行clean，只保留前250个token（cutoff 1-250），并将结果输出到output文件夹中
perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt $prep/valid 1 250

中间输出结果：

zhaodali@ubuntua:wmt14$ perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
clean-corpus.perl: processing /mnt/private_zhaodali_cq/datasets/security/wmt14/tmp/bpe.train.en & .de to /mnt/private_zhaodali_cq/datasets/security/wmt14/train, cutoff 1-250, ratio 1.5
..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..........(1700000)..........(1800000)..........(1900000)..........(2000000)..........(2100000)..........(2200000)..........(2300000)..........(2400000)..........(2500000)..........(2600000)..........(2700000)..........(2800000)..........(2900000)..........(3000000)..........(3100000)..........(3200000)..........(3300000)..........(3400000)..........(3500000)..........(3600000)..........(3700000)..........(3800000)..........(3900000)..........(4000000)..........(4100000)..........(4200000)..........(4300000)..........(4400000)..........(4500000)....
Input sentences: 4544200  Output sentences:  3961179
zhaodali@ubuntua:wmt14$ perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt $prep/valid 1 250
clean-corpus.perl: processing /mnt/private_zhaodali_cq/datasets/security/wmt14/tmp/bpe.valid.en & .de to /mnt/private_zhaodali_cq/datasets/security/wmt14/valid, cutoff 1-250, ratio 1.5
....
Input sentences: 45901  Output sentences:  40058

执行完毕之后，得到的文件结构是：

./wmt17_en_de
├── code
├── orig  # 原始数据集的tgz文件+解压之后的结果
│   ├── dev
│   ├── test-full
│   └── training
└── tmp
│   ├── bpe.test.de
│   ├── bpe.test.en
│   ├── bpe.train.de
│   ├── bpe.train.en
│   ├── bpe.valid.de
│   ├── bpe.valid.en
│   ├── test.de
│   ├── test.en
│   ├── train.de
│   ├── train.de-en
│   ├── train.en
│   ├── train.tags.en-de.tok.de
│   ├── train.tags.en-de.tok.en
│   ├── valid.de
│   └── valid.en
├── train.de
├── train.en
├── valid.de
└── valid.en

12、

1
2
3

for L in $src $tgt; do
    cp $tmp/bpe.test.$L $prep/test.$L      # 对于test语料，不进行clean，直接放到output文件夹。
done

执行完毕之后，得到的文件结构是：

./wmt17_en_de
├── code
├── orig  # 原始数据集的tgz文件+解压之后的结果
│   ├── dev
│   ├── test-full
│   └── training
└── tmp
│   ├── bpe.test.de  # bpe之后的结果
│   ├── bpe.test.en
│   ├── bpe.train.de
│   ├── bpe.train.en
│   ├── bpe.valid.de
│   ├── bpe.valid.en
│   ├── test.de
│   ├── test.en
│   ├── train.de  # 训练集与验证划分之后的结果
│   ├── train.de-en
│   ├── train.en
│   ├── train.tags.en-de.tok.de  # 训练集与验证集
│   ├── train.tags.en-de.tok.en
│   ├── valid.de
│   └── valid.en
├── train.de  # clean之后的结果
├── train.en
├── valid.de
└── valid.en
├── test.de
├── test.en

二值化

执行完上述指令之后，我们需要继续将数据Binarize。并且统计词频，得到vocabulary文件。

cd ../
TEXT=./wmt17_en_de  # 放置前面小节文件的根目录
fairseq-preprocess \
    --source-lang en --target-lang de \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0 \
    --workers 20

中间输出为：

2022-04-29 19:38:39 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/wmt17_en_de', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, simul_type=None, source_lang='en', srcdict=None, suppress_crashes=False, target_lang='de', task='translation', tensorboard_logdir=None, testpref='wmt14//test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='wmt14//train', use_plasma_view=False, user_dir=None, validpref='wmt14//valid', wandb_project=None, workers=20)
2022-04-29 19:38:57 | INFO | fairseq_cli.preprocess | [en] Dictionary: 40360 types
2022-04-29 19:39:39 | INFO | fairseq_cli.preprocess | [en] wmt14//train.en: 3961179 sents, 116600288 tokens, 0.0% replaced by <unk>
2022-04-29 19:39:39 | INFO | fairseq_cli.preprocess | [en] Dictionary: 40360 types
2022-04-29 19:39:42 | INFO | fairseq_cli.preprocess | [en] wmt14//valid.en: 40058 sents, 1180285 tokens, 0.00322% replaced by <unk>
2022-04-29 19:39:42 | INFO | fairseq_cli.preprocess | [en] Dictionary: 40360 types
2022-04-29 19:39:44 | INFO | fairseq_cli.preprocess | [en] wmt14//test.en: 3003 sents, 81185 tokens, 0.00246% replaced by <unk>
2022-04-29 19:39:44 | INFO | fairseq_cli.preprocess | [de] Dictionary: 42720 types
2022-04-29 19:40:26 | INFO | fairseq_cli.preprocess | [de] wmt14//train.de: 3961179 sents, 119369232 tokens, 0.0% replaced by <unk>
2022-04-29 19:40:26 | INFO | fairseq_cli.preprocess | [de] Dictionary: 42720 types
2022-04-29 19:40:31 | INFO | fairseq_cli.preprocess | [de] wmt14//valid.de: 40058 sents, 1209744 tokens, 0.00116% replaced by <unk>
2022-04-29 19:40:31 | INFO | fairseq_cli.preprocess | [de] Dictionary: 42720 types
2022-04-29 19:40:32 | INFO | fairseq_cli.preprocess | [de] wmt14//test.de: 3003 sents, 84629 tokens, 0.907% replaced by <unk>
2022-04-29 19:40:32 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to data-bin/wmt17_en_de

执行完毕之后，可以得到的文件结构如下。

.
├── data-bin
│   ├── preprocess.log
│   └── wmt17_en_de
│       ├── dict.de.txt  # vocabulary文件
│       ├── dict.en.txt
│       ├── preprocess.log
│       ├── test.en-de.de.bin
│       ├── test.en-de.de.idx
│       ├── test.en-de.en.bin
│       ├── test.en-de.en.idx
│       ├── train.en-de.de.bin
│       ├── train.en-de.de.idx
│       ├── train.en-de.en.bin
│       ├── train.en-de.en.idx
│       ├── valid.en-de.de.bin
│       ├── valid.en-de.de.idx
│       ├── valid.en-de.en.bin
│       └── valid.en-de.en.idx
└── wmt17_en_de
    ├── code
    ├── orig  # 原始数据集的tgz文件+解压之后的结果
    │   ├── dev
    │   ├── test-full
    │   ├── training
    ├── tmp
    │   ├── bpe.test.de
    │   ├── bpe.test.en
    │   ├── bpe.train.de
    │   ├── bpe.train.en
    │   ├── bpe.valid.de
    │   ├── bpe.valid.en
    │   ├── test.de
    │   ├── test.en
    │   ├── train.de
    │   ├── train.de-en
    │   ├── train.en
    │   ├── train.tags.en-de.tok.de
    │   ├── train.tags.en-de.tok.en
    │   ├── valid.de
    │   └── valid.en
    ├── train.de
    ├── train.en
    ├── valid.de
    └── valid.en
    ├── test.de
    ├── test.en

得到的data-bin文件夹就是我们处理完之后的结果，可以直接用来训练和测试。因为它其中的文件已经使用bpe编码了，所以不需要code文件，但是仍然需要dict.de.txt和dict.en.txt用于字符与ID之间的转换。

若直接测试句子的话，仍然需要code文件对该句子进行编码，然后需要dict.de.txt和dict.en.txt用于字符与ID之间的转换。

另外需要注意的是，因为joined_dictionary=False，所以dict.de.txt与dict.en.txt文件内容是不一样的。

joined_dictionary：源端和目标端使用同一个词表，对于相似语言（如英语和西班牙语）来说，有很多的单词是相同的，使用同一个词表可以降低词表和参数的总规模。

所以官方教程在训练时用的--share-decoder-input-output-embed参数。而我看另外一个dict.de.txt与dict.en.txt文件内容一致的，训练时用了--share-all-embeddings参数。

可以看这里： when you specify —share-all-embeddings then the embedding matrices for encoder input, decoder input and decoder output are all shared. when you specify —share-decoder-input-output-embed, then the matrices for decoder input and output are shared, but encoder has its own embeddings.

补充一下，当--share-decoder-input-output-embed时，实际对应的代码如下（fairseq/models/transformer/transformer_decoder.py文件中的build_output_projection函数）：

elif self.share_input_output_embed:
    self.output_projection = nn.Linear(
        self.embed_tokens.weight.shape[1],
        self.embed_tokens.weight.shape[0],
        bias=False,
    )
    self.output_projection.weight = self.embed_tokens.weight  # torch.Size([37056, 512])

参考

NMT Tutorial 3扩展e第2部分. Subword
深入理解NLP Subword算法：BPE、WordPiece、ULM
moses(mosesdecoder)数据预处理&BPE分词&moses用法总结
 机器翻译 bpe——bytes-pair-encoding以及开源项目subword-nmt快速入门
 Byte Pair Encoding
有必要了解的Subword算法模型
 bpe分词算法的原理
 BPE 算法原理及使用指南【深入浅出】
BPE 算法详解
 WMT14 en-de翻译数据集预处理步骤