Hello,欢迎来到九九小筑!

使用backward()求梯度的原理


avatar
陆 九九 2024-10-06 68






导数与偏导数的数学定义




导数与偏导数的数学定义

导数和偏导数都是函数对自变量的变化率进行描述的数学工具。从数学定义上讲,导数和偏导数只能是对自变量求导,任何对非自变量的求导都是不正确的。但在机器学习领域,尤其是深度学习框架中,如PyTorch,经常涉及到标量对向量求导的情况。

标量对向量求导的例子

考虑以下PyTorch代码:


import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2 + 2
z = torch.sum(y)
z.backward()
print(x.grad)

设 \(x = [x_1, x_2, x_3]\),则有:

\[ z = x_1^2 + x_2^2 + x_3^2 + 6 \]

对 \(z\) 分别对 \(x_1, x_2, x_3\) 求偏导数:

\[ \frac{\partial z}{\partial x_1} = 2x_1 \]
\[ \frac{\partial z}{\partial x_2} = 2x_2 \]
\[ \frac{\partial z}{\partial x_3} = 2x_3 \]

将 \(x_1 = 1.0, x_2 = 2.0, x_3 = 3.0\) 代入,得到:

\[
\left( \frac{\partial z}{\partial x_1}, \frac{\partial z}{\partial x_2}, \frac{\partial z}{\partial x_3} \right) = (2.0, 4.0, 6.0)
\]

这与PyTorch的输出结果一致。实际上,所谓的“标量对向量求导”本质上是函数对各个自变量分别求导,只是将各个自变量看作一个向量,这与数学上的定义并不矛盾。

backward函数中的gradient参数作用

考虑以下问题:

已知:

\[ y_1 = x_1 \cdot x_2 \cdot x_3 \]
\[ y_2 = x_1 + x_2 + x_3 \]
\[ y_3 = x_1 + x_2 \cdot x_3 \]
\[ A = f(y_1, y_2, y_3) \]

其中函数 \(f(y_1, y_2, y_3)\) 的具体定义未知,求:

\[ \frac{\partial A}{\partial x_1} = ? \]
\[ \frac{\partial A}{\partial x_2} = ? \]
\[ \frac{\partial A}{\partial x_3} = ? \]

根据多元复合函数的求导法则:

\[ \frac{\partial A}{\partial x_1} = \frac{\partial A}{\partial y_1} \frac{\partial y_1}{\partial x_1} + \frac{\partial A}{\partial y_2} \frac{\partial y_2}{\partial x_1} + \frac{\partial A}{\partial y_3} \frac{\partial y_3}{\partial x_1} \]
\[ \frac{\partial A}{\partial x_2} = \frac{\partial A}{\partial y_1} \frac{\partial y_1}{\partial x_2} + \frac{\partial A}{\partial y_2} \frac{\partial y_2}{\partial x_2} + \frac{\partial A}{\partial y_3} \frac{\partial y_3}{\partial x_2} \]
\[ \frac{\partial A}{\partial x_3} = \frac{\partial A}{\partial y_1} \frac{\partial y_1}{\partial x_3} + \frac{\partial A}{\partial y_2} \frac{\partial y_2}{\partial x_3} + \frac{\partial A}{\partial y_3} \frac{\partial y_3}{\partial x_3} \]

这些等式可以写成矩阵相乘的形式:

\[
\left[ \frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3} \right] = \left[ \frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3} \right] \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3} \\
\frac{\partial y_3}{\partial x_1} & \frac{\partial y_3}{\partial x_2} & \frac{\partial y_3}{\partial x_3}
\end{bmatrix}
\]

其中,矩阵

\[
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3} \\
\frac{\partial y_3}{\partial x_1} & \frac{\partial y_3}{\partial x_2} & \frac{\partial y_3}{\partial x_3}
\end{bmatrix}
\]

称为雅可比矩阵(Jacobian)。雅可比矩阵可以根据已知条件求出。

只要知道 \(\left[ \frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3} \right]\) 的值,即使不知道 \(f(y_1, y_2, y_3)\) 的具体形式,也能求出 \(\left[ \frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3} \right]\)。

在PyTorch中,backward函数的gradient参数就是用来提供 \(\left[ \frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3} \right]\) 的值的。通过gradient参数,可以直接给定一个梯度来求得对各个自变量的偏导数,而无需知道复合函数在某个位置之前的所有函数的具体形式。

代码示例


# coding utf-8
import torch

x1 = torch.tensor(1, requires_grad=True, dtype=torch.float)
x2 = torch.tensor(2, requires_grad=True, dtype=torch.float)
x3 = torch.tensor(3, requires_grad=True, dtype=torch.float)
y = torch.randn(3)
y[0] = x1 * x2 * x3
y[1] = x1 + x2 + x3
y[2] = x1 + x2 * x3
x = torch.tensor([x1, x2, x3])
y.backward(torch.tensor([0.1, 0.2, 0.3], dtype=torch.float))
print(x1.grad)
print(x2.grad)
print(x3.grad)

按照上述推导方法:

\[
\left[ \frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3} \right] = \left[ \frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3} \right] \begin{bmatrix}
x_2 \cdot x_3 & x_1 \cdot x_3 & x_1 \cdot x_2 \\
1 & 1 & 1 \\
1 & x_3 & x_2
\end{bmatrix}
\]

代入 \(x_1 = 1, x_2 = 2, x_3 = 3\) 和梯度 \(\left[ 0.1, 0.2, 0.3 \right]\):

\[
\left[ \frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3} \right] = \left[ 0.1, 0.2, 0.3 \right] \begin{bmatrix}
6 & 3 & 2 \\
1 & 1 & 1 \\
1 & 3 & 2
\end{bmatrix} = [1.1, 1.4, 1.0]
\]

这与代码的运行结果一致。


相关阅读