Pytorch损失函数总结

在这学期刚开始的时候深入接触了TensorFlow的session和graph概念，虽然相比之前对Tensorflow有了更加深刻的理解。但是有的时候写起来代码还是不能随心所欲。幸好后来接触到了Pytorch框架，感觉设计的很少优雅，清晰易懂，想怎么写怎么写～再加上前段时间好好学习了什么是交叉熵。下面对Pytorch的损失函数进行详细的总结。其中大部分内容均来自于pytorch loss function 总结，然后会加上一些自己的看法。

最近看了下 PyTorch 的损失函数文档，整理了下自己的理解，重新格式化了公式如下，以便以后查阅。

值得注意的是，很多的 loss 函数都有 size_average 和 reduce 两个布尔类型的参数，需要解释一下。因为一般损失函数都是直接计算 batch 的数据，因此返回的 loss 结果都是维度为 (batch_size, ) 的向量。

如果 reduce = False，那么 size_average 参数失效，直接返回向量形式的 loss；
如果 reduce = True，那么 loss 返回的是标量
- 如果 size_average = True，返回 loss.mean();
- 如果 size_average = True，返回 loss.sum();

所以下面讲解的时候，一般都把这两个参数设置成 False，这样子比较好理解原始的损失函数定义。

下面是常见的损失函数。

nn.L1Loss

$\text{loss}(\mathbf{x}_i, \mathbf{y}_i) = |\mathbf{x}_i - \mathbf{y}_i|$

这里表述的还是不太清楚，其实要求$\mathbf{x}$和$\mathbf{y}$的维度要一样（可以是向量或者矩阵），得到的 loss 维度也是对应一样的。这里用下标$ i $表示第 $i$ 个元素。

loss_fn = torch.nn.L1Loss(reduce=False, size_average=False)
input = torch.autograd.Variable(torch.randn(3,4))
target = torch.autograd.Variable(torch.randn(3,4))
loss = loss_fn(input, target)
print(input); print(target); print(loss)
print(input.size(), target.size(), loss.size())

tensor([[ 1.3165,  0.3775,  0.7050,  1.5433],
        [ 0.6242, -0.6022,  1.3705, -2.1026],
        [ 1.1888,  0.7403, -0.1543, -1.0201]])
tensor([[ 0.0808, -0.4403,  0.6464, -0.0929],
        [-1.1898, -1.1335, -1.0669,  0.3216],
        [-0.6352, -1.4463,  1.3619,  0.9186]])
tensor([[1.2357, 0.8178, 0.0586, 1.6362],
        [1.8140, 0.5313, 2.4375, 2.4241],
        [1.8240, 2.1866, 1.5162, 1.9387]])
torch.Size([3, 4]) torch.Size([3, 4]) torch.Size([3, 4])

nn.SmoothL1Loss

也叫作 uber Loss在 (-1,1) 上是平方损失，其他情况是 L1 损失。

$\text{loss}(\mathbf{x}_i, \mathbf{y}_i) = \left\{\begin{matrix} \frac12(\mathbf{x}_i -\mathbf{y}_i)^2 & \text{if } |\mathbf{x}_i -\mathbf{y}_i| < 1 \\ |\mathbf{x}_i -\mathbf{y}_i| - \frac12, & \text{otherwise} \end{matrix}\right.$

这里很上面的 L1Loss 类似，都是 element-wise 的操作，下标$i$是$\mathbf x$的第$ i$ 个元素。

loss_fn = torch.nn.SmoothL1Loss(reduce=False, size_average=False)
input = torch.autograd.Variable(torch.randn(3,4))
target = torch.autograd.Variable(torch.randn(3,4))
loss = loss_fn(input, target)
print(input); print(target); print(loss)
print(input.size(), target.size(), loss.size())

tensor([[-0.0594, -0.4357,  0.8279,  1.3302],
        [-1.4642,  0.1844,  0.2676, -0.5577],
        [-0.9834,  0.4354, -0.5264, -0.8768]])
tensor([[ 1.5666, -1.1353, -0.4636, -0.4286],
        [ 0.5522, -0.4245,  0.3089, -0.0351],
        [-2.2251, -2.1202,  1.0612, -0.8749]])
tensor([[1.1260e+00, 2.4475e-01, 7.9155e-01, 1.2588e+00],
        [1.5164e+00, 1.8537e-01, 8.5103e-04, 1.3658e-01],
        [7.4173e-01, 2.0556e+00, 1.0877e+00, 1.7281e-06]])
torch.Size([3, 4]) torch.Size([3, 4]) torch.Size([3, 4])

nn.MSELoss

均方损失函数，用法和上面类似，这里 loss, x, y 的维度是一样的，可以是向量或者矩阵，$i $是下标。

$\text{loss}(\mathbf{x}_i, \mathbf{y}_i) = (\mathbf{x}_i - \mathbf{y}_i)^2$

loss_fn = torch.nn.MSELoss(reduce=False, size_average=False)
input = torch.autograd.Variable(torch.randn(3,4))
target = torch.autograd.Variable(torch.randn(3,4))
loss = loss_fn(input, target)
print(input); print(target); print(loss)
print(input.size(), target.size(), loss.size())

tensor([[ 1.5255, -0.0708,  0.8906, -0.0271],
        [-0.8525, -1.3048,  0.6440,  0.5908],
        [ 0.8502, -1.2114,  0.8382, -0.3822]])
tensor([[ 1.0601, -1.0832, -0.4962, -0.4466],
        [ 1.4013, -0.1358,  2.3916,  1.1730],
        [-0.4183,  1.7537,  0.1232,  1.5505]])
tensor([[0.2166, 1.0250, 1.9234, 0.1759],
        [5.0798, 1.3666, 3.0541, 0.3389],
        [1.6092, 8.7917, 0.5113, 3.7355]])
torch.Size([3, 4]) torch.Size([3, 4]) torch.Size([3, 4])

nn.BCELoss

二分类用的交叉熵，用的时候需要在该层前面加上 Sigmoid 函数。交叉熵的定义参考 wikipedia 页面： Cross Entropy

因为离散版的交叉熵定义是$H(\boldsymbol{p}, \boldsymbol{q}) = -\sum_i \boldsymbol{p}_i \log \boldsymbol{q}_i$，其中$\boldsymbol{p}, \boldsymbol{q}$都是向量，且都是概率分布。如果是二分类的话，因为只有正例和反例，且两者的概率和为 1，那么只需要预测一个概率就好了（也就是真实类标和实际输出均不需要经过one-hot），因此可以简化成

$\text{loss}(\mathbf{x}_i, \mathbf{y}_i) = - \boldsymbol{w}_i \left[\mathbf{y}_i \log \mathbf{x}_i + (1-\mathbf{y}_i)\log(1-\mathbf{x}_i) \right ]$

注意这里$\mathbf{x}, \mathbf{y}$可以是向量或者矩阵，$i$只是下标；$\mathbf{x}_i$ 表示第 $i $个样本预测为正例的概率；$\mathbf{y}_i$表示第 $ i $个样本的标签，$\boldsymbol{w}_i$表示该项的权重大小。可以看出，loss, x, y, w 的维度都是一样的。

import torch.nn.functional as F
loss_fn = torch.nn.BCELoss(reduce=False, size_average=False)
input = Variable(torch.randn(3, 4))
target = Variable(torch.FloatTensor(3, 4).random_(2))
loss = loss_fn(F.sigmoid(input), target)
print(input); print(target); print(loss)

tensor([[ 0.0088, -1.3303,  1.2703,  0.4140],
        [ 0.5810,  0.6128, -0.9939,  1.0171],
        [-2.1501,  0.8889, -2.5859,  0.8738]])
tensor([[1., 0., 0., 1.],
        [0., 1., 0., 1.],
        [1., 1., 0., 1.]])
tensor([[0.6888, 0.2346, 1.5177, 0.5074],
        [1.0253, 0.4330, 0.3149, 0.3087],
        [2.2603, 0.3444, 0.0726, 0.3488]])

这里比较奇怪的是，权重的维度不是 2，而是和 x, y 一样，有时候遇到正负例样本不均衡的时候，可能要多写一句话

class_weight = Variable(torch.FloatTensor([1, 10])) # 这里正例比较少，因此权重要大一些
target = Variable(torch.FloatTensor(3, 4).random_(2))
weight = class_weight[target.long()] # (3, 4)
loss_fn = torch.nn.BCELoss(weight=weight, reduce=False, size_average=False)
# balabala...

class_weight
Out[16]: tensor([ 1., 10.])

weight
Out[15]: 
tensor([[10., 10.,  1., 10.],
        [ 1., 10.,  1.,  1.],
        [ 1., 10., 10.,  1.]])

其实这样子做的话，如果每次 batch_size 长度不一样，只能每次都定义 loss_fn 了，不知道有没有更好的解决方案。

nn.BCEWithLogitsLoss

上面的 nn.BCELoss 需要手动加上一个 Sigmoid 层，这里是结合了两者，这样做能够利用 log_sum_exp trick，使得数值结果更加稳定（numerical stability）。建议使用这个损失函数。

值得注意的是，文档里的参数只有 weight, size_average 两个，但是实际测试 reduce 参数也是可以用的。此外两个损失函数的 target 要求是 FloatTensor，而且不一定是只能取 0, 1 两种值，任意值应该都是可以的。

nn.CrossEntropyLoss

多分类用的交叉熵损失函数，用这个 loss 前面不需要加 Softmax 层。

这里损害函数的计算，按理说应该也是原始交叉熵公式的形式，但是这里限制了 target 类型为 torch.LongTensr，而且不是多标签意味着标签是 one-hot 编码的形式，即只有一个位置是 1，其他位置都是 0，那么带入交叉熵公式中化简后就成了下面的简化形式。参考 cs231n 作业里对 Softmax Loss 的推导。

$\begin{align*} \text{loss}(\mathbf{x}, \text{label}) &= - \boldsymbol{w}_{\text{label}}\log \frac{e^{\mathbf{x}_{\text{label}}}}{\sum_{j=1}^N e^{\mathbf{x}_j}} \\ &= \boldsymbol{w}_{\text{label}} \left[ -\mathbf{x}_{\text{label}} + \log \sum_{j=1}^N e^{\mathbf{x}_j} \right] \end{align*}$

这里的$\mathbf{x} \in \mathbb{R}^N$，是没有经过 Softmax 的激活值，$N$是$\mathbf x $的维度大小（或者叫特征维度）；$\text{label} \in [0, C-1]$是标量，是对应的标签，可以看到两者维度是不一样的（真实类标即target不需要one-hot，而实际输出需要）。C 是要分类的个数。$\boldsymbol{w} \in \mathbb{R}^C$是维度为$C$的向量，表示标签的权重，样本少的类别，可以考虑把权重设置大一点。

weight = torch.Tensor([1,2,1,1,10])
loss_fn = torch.nn.CrossEntropyLoss(reduce=False, size_average=False, weight=weight)
input = Variable(torch.randn(3, 5)) # (batch_size, C)
target = Variable(torch.FloatTensor(3).random_(5))
loss = loss_fn(input, target)
print(input); print(target); print(loss)

tensor([[ 0.1040,  0.0172,  0.2849, -0.7865, -1.5540],
        [ 1.6831, -1.5757, -0.4303,  1.0627,  0.0863],
        [-1.0147, -1.5500,  0.1911, -1.8070,  0.7058]])
tensor([3, 2, 4])
tensor([2.2031, 2.7551, 6.7425])

nn.NLLLoss

用于多分类的负对数似然损失函数（Negative Log Likelihood）

$\text{loss}(\mathbf{x}, \text{label}) = - \mathbf{x}_{\text{label}}$

在前面接上一个 nn.LogSoftMax 层就等价于交叉熵损失了。事实上，nn.CrossEntropyLoss 也是调用这个函数。注意这里的$\mathbf{x}_{\text{label}}$和上个交叉熵损失里的不一样（虽然符号我给写一样了），这里是经过$\text{logSoftMax}$运算后的数值，

nn.NLLLoss2d

和上面类似，但是多了几个维度，一般用在图片上。现在的 pytorch 版本已经和上面的函数合并了。

input, (N, C, H, W)
target, (N, H, W)

比如用全卷积网络做 Semantic Segmentation 时，最后图片的每个点都会预测一个类别标签。

nn.KLDivLoss

KL 散度，又叫做相对熵，算的是两个分布之间的距离，越相似则越接近零。

$\text{loss}(\mathbf{x}, \mathbf{y}) = \frac1N \sum_{i=1}^N [\mathbf{y}_i * (\log \mathbf{y}_i - \mathbf{x}_i)]$

注意这里的$\mathbf{x}_i$是 log 概率，刚开始还以为 API 弄错了。

nn.MarginRankingLoss

评价相似度的损失

$\text{loss}(x_1, x_2, y) = \max(0, -y * (x_1 - x_2) + \text{margin})$

这里的三个都是标量，y 只能取 1 或者 -1，取 1 时表示 x1 比 x2 要大；反之 x2 要大。参数 margin 表示两个向量至少要相聚 margin 的大小，否则 loss 非负。默认 margin 取零。

nn.MultiMarginLoss

多分类（multi-class）的 Hinge 损失，

$\text{loss}(\mathbf{x}, y) = \frac1N \sum_{i=1, i\ne y}^N \max(0, (\text{margin} - \mathbf{x}_y + \mathbf{x}_i)^p)$

其中$1 \le y \le N$ 表示标签，$p $默认取 1，margin 默认取 1，也可以取别的值。参考 cs231n 作业里对 SVM Loss 的推导。

nn.MultiLabelMarginLoss

多类别（multi-class）多分类（multi-classification）的 Hinge 损失，是上面 MultiMarginLoss 在多类别上的拓展。同时限定 p = 1，margin = 1.

$\text{loss}(\mathbf{x}, \mathbf{y}) = \frac1N \sum_{i=1,i \ne \mathbf{y}_j}^n \sum_{j=1}^{\mathbf{y}_j \ne 0} [\max(0, 1 - (\mathbf{x}_{\mathbf{y}_j} - \mathbf{x}_i))]$

这个接口有点坑，是直接从 Torch 那里抄过来的，见 MultiLabelMarginCriterion 的描述。而 Lua 的下标和 Python 不一样，前者的数组下标是从 1 开始的，所以用 0 表示占位符。有几个坑需要注意，

这里的$\mathbf{x}, \mathbf{y}$ 都是大小为 $N$的向量，如果 $y$ 不是向量而是标量，后面的$\sum_j$就没有了，因此就退化成上面的 MultiMarginLoss.
限制 $\mathbf y$ 的大小为 $N$，是为了处理多标签中标签个数不同的情况，用 0 表示占位，该位置和后面的数字都会被认为不是正确的类。如 $\mathbf y=[5,3,0,0,4]$那么就会被认为是属于类别 5 和 3，而 4 因为在零后面，因此会被忽略。
上面的公式和说明只是为了和文档保持一致，其实在调用接口的时候，用的是 -1 做占位符，而 0 是第一个类别。

举个例子

import torch
loss = torch.nn.MultiLabelMarginLoss()
x = torch.autograd.Variable(torch.FloatTensor([[0.1, 0.2, 0.4, 0.8]]))
y = torch.autograd.Variable(torch.LongTensor([[3, 0, -1, 1]]))
print(loss(x, y)) # will give 0.8500

按照上面的理解，第 3, 0 个是正确的类，1, 2 不是，那么，

$\begin{align*} \text{loss} &= \frac14\sum_{i=1,2}\sum_{j=3,0} [\max(0, 1 - (x_j - x_i))] \\ &= \frac14 [(1-(0.8-0.2)) + (1-(0.1-0.2)) + (1-(0.8-0.4)) + (1-(0.1-0.4))] \\ &= \frac14 [0.4 + 1.1 + 0.6 + 1.3] = 0.85 \end{align*}$

注意这里推导的第二行，我为了简短，都省略了 max(0, x) 符号。

nn.SoftMarginLoss

多标签二分类问题，这$N$项都是二分类问题，其实就是把 $N$ 个二分类的 loss 加起来，化简一下。其中$\mathbf{y}$只能取 $1,−1$ 两种，代表正类和负类。和下面的其实是等价的，只是$\mathbf{y}$的形式不同。

$\text{loss}(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^N \log(1+e^{-\mathbf{y}_i \mathbf{x}_i})$

nn.MultiLabelSoftMarginLoss

上面的多分类版本，根据最大熵的多标签 one-versue-all 损失，其中 $\mathbf{y}$只能取 1,0 两种，代表正类和负类。

$\text{loss}(\mathbf{x}, \mathbf{y}) = -\sum_{i=1}^N\left[ \mathbf{y}_i\log\frac{e^{\mathbf{x}_i}}{1+e^{\mathbf{x}_i}} + (1- \mathbf{y}_i)\log\frac{1}{1+e^{\mathbf{x}_i}}\right]$

nn.CosineEmbeddingLoss

余弦相似度的损失，目的是让两个向量尽量相近。注意这两个向量都是有梯度的。

$\text{loss}(\mathbf{x}, \mathbf{y}) = \left\{\begin{aligned} &1 - \cos(\mathbf{x}, \mathbf{y}) &\text{if } &y == 1 \\ &\max(0, \cos(\mathbf{x}, \mathbf{y}) + \text{margin}) &\text{if }&y == -1 \\ \end{aligned}\right.$

margin 可以取 $[−1,1]$，但是比较建议取 0-0.5 较好。

nn.HingeEmbeddingLoss

不知道做啥用的。另外文档里写错了，$\mathbf{x}, \mathbf{y}$ 的维度应该是一样的。

$\text{loss}(\mathbf{x}, \mathbf{y}) = \frac1N \left\{\begin{aligned} & \mathbf{x}_i & \text{if } &\mathbf{y}_i == 1 & \\ & \max(0, \text{margin} - \mathbf{x}_i) &\text{if } &\mathbf{y}_i == -1 & \end{aligned}\right.$

nn.TripleMarginLoss

$L(\mathbf{a},\mathbf{p},\mathbf{n}) = \frac1N \left (\sum_{i=1}^N\max(0,\ d(\mathbf{a}_i, \mathbf{p}_i) - d(\mathbf{a}_i, \mathbf{n}_i) + \text{margin}) \right )$

其中$d(\mathbf{x}_i, \mathbf{y}_i) = |\mathbf{x}_i - \mathbf{y}_i|_2^2$

参考

pytorch loss function 总结