从流形的观点分析神经网络

无意中看到一本用数学分析神经网络的书，里面用各种数学工具来分析神经网络（如数学分析、线性代数、流形、信息论、概率论、优化等），书的信息如下：

Ovidiu Calin, Deep Learning Architectures - A Mathematical Approach, Springer, 2020.

我看了用流形解释神经网络那一章的前面几页，觉得写的还不错，记录一下。
考虑一个神经元，输入是 $\textbf{x}\in \mathbb{R}^n$ ，输出是 $y=\sigma(w^T\textbf{x}+b )\in \mathbb{R}$ 。不妨取 $\sigma$ 是 logistic function。则集合
$S=\{\sigma(w^T\textbf{x}+b );w\in\mathbb{R}^n,b\in \mathbb{R} \}$
是一个 $n + 1$ 维的流形。它可以看成是 $\mathbb{R}^n$ 上全体连续函数空间（其维数是无穷维）的一个子流形。事实上，计算得
$\frac{\partial{y}}{\partial{b}}=\sigma'(w^T\textbf{x}+b )=y(1-y) \\ \frac{\partial{y}}{\partial{w_j}}=\sigma'(w^T\textbf{x}+b )x_j=y(1-y)x_j$
我们说明 $\{\frac{\partial{y}}{\partial{b}},\frac{\partial{y}}{\partial{w_j}}\}$ 线性无关。若 $\alpha_0\frac{\partial{y}}{\partial{b}}+\sum_{i=1}^{n}\alpha_i\frac{\partial{y}}{\partial{w_i}}=0$ ，则代入得 $\alpha_0y(1-y) +\sum_{i=1}^{n}\alpha_iy(1-y) x_j=0$ ，由 $y(1-y)\neq0$ 知道 $\alpha_0 +\sum_{i=1}^{n}\alpha_i x_j=0$ 。再由 $x_j$ 任意性即得结论。从而Jacobian矩阵 $J_y$ 满秩（为 $n + 1$ ）。
接下来，训练神经网络的过程实际上是拟合一个函数 $z=z(\textbf{x})$ 。如果 $z$ 在流形 $S$ 上，那么存在 $w^*\in\mathbb{R}^n,b^*\in \mathbb{R}$ 使得 $z=y^*=y(w^*,b^*)$ 。然而，更一般的情况是 $z\notin S$ ，这意味着需要找 $w^*\in\mathbb{R}^n,b^*\in \mathbb{R}$ 使得
$(w^*,b^*)=\mathop{argmin}\limits_{w,b} dist(z,S)$
给定初值 $w_0,b_0)$ ，一个学习算法会产生一个序列 $w_n,b_n)_n$ ，期望它收敛到 $w^*,b^*)$ 。按作者原话：If the parameters update is made continuously (implied by an infinitesimal learning rate), then we obtain a curve $c (t) = (w (t), b (t))$ joining $w_0,b_0)$ and $w^*,b^*)$ . This can be lifted to the curve $\circ c(t)$ on the manifold $S$ . The fastest learning algorithm corresponds to the “shortest” curve between $y(w_0,b_0)$ and $y(w^*,b^*)$ . The attribute “shortest” depends on the intrinsic geometry of the manifold $S$ , and this topic will be discussed in the next section. 这样这个优化问题就可以和后面的黎曼度量、测地线等概念建立关联了。
对一般的神经网络，如果我们增大神经元的个数，则对应的参数也相应增多， $S$ 的维数也增加。记 $M = C ([0, 1])$ ，我们知道对于任意固定的 $\epsilon>0$ ，以及任意的 $f\in M$ ，总有一个足够高维数的 $S$ 使得 $dist(f,S)<\epsilon$ ，其中
$dist(f,S)=\mathop{inf}\limits_{s\in S}\mathop{max}\limits_{x\in [0,1]}|f(x)-s(x)|$
然而实际问题中神经元个数是受限的，如何处理也是作者讨论的话题。