导数与偏导数的数学定义
导数和偏导数都是函数对自变量的变化率进行描述的数学工具。从数学定义上讲,导数和偏导数只能是对自变量求导,任何对非自变量的求导都是不正确的。但在机器学习领域,尤其是深度学习框架中,如PyTorch,经常涉及到标量对向量求导的情况。
标量对向量求导的例子
考虑以下PyTorch代码:
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2 + 2
z = torch.sum(y)
z.backward()
print(x.grad)
设 \(x = [x_1, x_2, x_3]\),则有:
\[ z = x_1^2 + x_2^2 + x_3^2 + 6 \]
对 \(z\) 分别对 \(x_1, x_2, x_3\) 求偏导数:
\[ \frac{\partial z}{\partial x_1} = 2x_1 \]
\[ \frac{\partial z}{\partial x_2} = 2x_2 \]
\[ \frac{\partial z}{\partial x_3} = 2x_3 \]
将 \(x_1 = 1.0, x_2 = 2.0, x_3 = 3.0\) 代入,得到:
\[
\left( \frac{\partial z}{\partial x_1}, \frac{\partial z}{\partial x_2}, \frac{\partial z}{\partial x_3} \right) = (2.0, 4.0, 6.0)
\]
这与PyTorch的输出结果一致。实际上,所谓的“标量对向量求导”本质上是函数对各个自变量分别求导,只是将各个自变量看作一个向量,这与数学上的定义并不矛盾。
backward
函数中的gradient
参数作用
考虑以下问题:
已知:
\[ y_1 = x_1 \cdot x_2 \cdot x_3 \]
\[ y_2 = x_1 + x_2 + x_3 \]
\[ y_3 = x_1 + x_2 \cdot x_3 \]
\[ A = f(y_1, y_2, y_3) \]
其中函数 \(f(y_1, y_2, y_3)\) 的具体定义未知,求:
\[ \frac{\partial A}{\partial x_1} = ? \]
\[ \frac{\partial A}{\partial x_2} = ? \]
\[ \frac{\partial A}{\partial x_3} = ? \]
根据多元复合函数的求导法则:
\[ \frac{\partial A}{\partial x_1} = \frac{\partial A}{\partial y_1} \frac{\partial y_1}{\partial x_1} + \frac{\partial A}{\partial y_2} \frac{\partial y_2}{\partial x_1} + \frac{\partial A}{\partial y_3} \frac{\partial y_3}{\partial x_1} \]
\[ \frac{\partial A}{\partial x_2} = \frac{\partial A}{\partial y_1} \frac{\partial y_1}{\partial x_2} + \frac{\partial A}{\partial y_2} \frac{\partial y_2}{\partial x_2} + \frac{\partial A}{\partial y_3} \frac{\partial y_3}{\partial x_2} \]
\[ \frac{\partial A}{\partial x_3} = \frac{\partial A}{\partial y_1} \frac{\partial y_1}{\partial x_3} + \frac{\partial A}{\partial y_2} \frac{\partial y_2}{\partial x_3} + \frac{\partial A}{\partial y_3} \frac{\partial y_3}{\partial x_3} \]
这些等式可以写成矩阵相乘的形式:
\[
\left[ \frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3} \right] = \left[ \frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3} \right] \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3} \\
\frac{\partial y_3}{\partial x_1} & \frac{\partial y_3}{\partial x_2} & \frac{\partial y_3}{\partial x_3}
\end{bmatrix}
\]
其中,矩阵
\[
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3} \\
\frac{\partial y_3}{\partial x_1} & \frac{\partial y_3}{\partial x_2} & \frac{\partial y_3}{\partial x_3}
\end{bmatrix}
\]
称为雅可比矩阵(Jacobian)。雅可比矩阵可以根据已知条件求出。
只要知道 \(\left[ \frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3} \right]\) 的值,即使不知道 \(f(y_1, y_2, y_3)\) 的具体形式,也能求出 \(\left[ \frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3} \right]\)。
在PyTorch中,backward
函数的gradient
参数就是用来提供 \(\left[ \frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3} \right]\) 的值的。通过gradient
参数,可以直接给定一个梯度来求得对各个自变量的偏导数,而无需知道复合函数在某个位置之前的所有函数的具体形式。
代码示例
# coding utf-8
import torch
x1 = torch.tensor(1, requires_grad=True, dtype=torch.float)
x2 = torch.tensor(2, requires_grad=True, dtype=torch.float)
x3 = torch.tensor(3, requires_grad=True, dtype=torch.float)
y = torch.randn(3)
y[0] = x1 * x2 * x3
y[1] = x1 + x2 + x3
y[2] = x1 + x2 * x3
x = torch.tensor([x1, x2, x3])
y.backward(torch.tensor([0.1, 0.2, 0.3], dtype=torch.float))
print(x1.grad)
print(x2.grad)
print(x3.grad)
按照上述推导方法:
\[
\left[ \frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3} \right] = \left[ \frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3} \right] \begin{bmatrix}
x_2 \cdot x_3 & x_1 \cdot x_3 & x_1 \cdot x_2 \\
1 & 1 & 1 \\
1 & x_3 & x_2
\end{bmatrix}
\]
代入 \(x_1 = 1, x_2 = 2, x_3 = 3\) 和梯度 \(\left[ 0.1, 0.2, 0.3 \right]\):
\[
\left[ \frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3} \right] = \left[ 0.1, 0.2, 0.3 \right] \begin{bmatrix}
6 & 3 & 2 \\
1 & 1 & 1 \\
1 & 3 & 2
\end{bmatrix} = [1.1, 1.4, 1.0]
\]
这与代码的运行结果一致。