Katyusha
\(Katyusha\)算是\(Gradient-Mirror Linear Coupling\)的扩展。Linear Coupling是\(GradientDescent\)和\(MirrorDescent\)的线性组合,达到了Nesterov加速的效果。但Linear Coupling将\(GD\)换成\(SGD\)的时候效果并不好,于是Zeyuan为\(SGD\)重新设计一个加速的方法,它依旧是以线性耦合的思想,并利用\(SVRG\)更好帮助其加速。
1. Momentum
\[v_{k+1} = \gamma v_{k} + g_k\] \[x_{k+1} = x_k - v_{k+1}\]其中\(\gamma\)为动量常数,\(g\)为步长,普通的momentum中\(g_k=\eta \triangledown f(x_k)\)计算当前梯度,Nesterov’s Momentum中\(g_k= \eta \triangledown f(x_k-\gamma v_{k})\)计算向前一步后的梯度,\(\eta\)为学习率。momentum和Nesterov’s Momentum本质上都是计算历史梯度的加权平均,下文将两者统称为momentum。
\(v_0 = 0\) | \(x_0\) |
\(v_1 = g_0\) | \(x_1 = x_0 - g_0\) |
\(v_2 = \gamma g_0 + g_1\) | \(x_2 = x_0 - (1+\gamma)g_0 - g_1\) |
\(v_3 = \gamma^2 g_0 + \gamma g_1 + g_2\) | \(x_3 = x_0 - (1+\gamma+\gamma^2)g_0 - (1+\gamma)g_1 - g_2\) |
\(...\) | \(...\) |
\(v_{k+1} = \sum_{i=0}^{k} \gamma^i g_{k-i}\) | \(x_{k+1} = x_0 -\sum_{i=0}^{k} v_i\) |
动量参数\(\gamma \in (0,1)\),当前更新的步长\(x_{k+1} = x_k-v_{k+1}\)为历史梯度序列的加权组合,且时间越近的权重越大越远权重越小。
\[x_{k+1} = x_0 - g_{k} -(1+\gamma)g_{k-1} -...- (1+\gamma+...+\gamma^k)g_0\]而随着迭代次数的增加,\(x_{k+1}\)减去关于梯度\(g_0\)的权重在逐渐增加。
2. Katyusha
\[\begin{split} \quad x_{k+1}& \leftarrow \tau_1 z_k+\tau_2 \tilde{x} + (1-\tau_1-\tau_2) y_k\\ \tilde{\triangledown}_{k+1} & \leftarrow \triangledown f(\tilde{x})+\triangledown f_i(x_{k+1}) -\triangledown_i f(\tilde{x})\\ y_{k+1}&\leftarrow x_{k+1} - \frac{1}{3L}\tilde{\triangledown}_{k+1} \\ z_{k+1}&\leftarrow z_{k} - \alpha\tilde{\triangledown}_{k+1} \\ \end{split}\]当\(\tau_1 = \tau_2=0\)时,算法就变成\(SVRG\).
当\(\tau_2=0\),令\(\tau = \tau_1\),即变成了Linear Coupling(LC)中两者线性组合的形式,不同的是\(LC\)中是\(GD\)与\(MD\)的组合,而这里是\(SGD\),估计梯度\(\tilde{\triangledown}_{k+1}\)利用SVRG计算。而\(LC\)实际上也是momentum加速算法的一种形式。
2.1 Linear Coupling与Momentum
\[\quad x_{k+1} \leftarrow \tau z_k+ (1-\tau) y_k\]跟上面momentum类似,这里的\(g=\eta \tilde{\triangledown}_{k+1}\)是利用SVRG计算的,\(y_{k+1}\)是在\(x_{k+1}\)的基础上进行一步\(SGD\),\(z_{k+1}\)是在\(z_{k}\)上进行一步\(MirrorDescent\).在Linear Coupling中说过\(\tau=\frac{1}{\alpha L+1}\),那么\(\alpha = \frac{1}{\tau L} - \frac{1}{L}<\frac{1}{\tau L}\),在Katyusha中,为了计算简便取\(\alpha =\frac{1}{3\tau L}\).下面看它是如何进行动量计算的。
初始值:\(y_0 = z_0=x_0\)
\(x_1 = x_0\) | \(y_1 = x_0-\frac{1}{3L}\tilde{\triangledown}_{1}\) | \(z_1 = x_0 - \alpha \tilde{\triangledown}_{1}\) |
\(x_2 = x_0 - ( \frac{1-\tau}{3L}+\tau\alpha ) \tilde{\triangledown}_{1}\) | \(y_2 = x_0 - \frac{1}{3L}\tilde{\triangledown}_{2} -( \frac{1-\tau}{3L} +\tau\alpha) \tilde{\triangledown}_{1}\) | \(z_2 = x_0 - \alpha(\tilde{\triangledown}_{1} + \tilde{\triangledown}_{2})\) |
\(x_3 = x_0 - ((1-\tau)\frac{1}{3L} +\tau\alpha ) \tilde{\triangledown}_{2} - ((1-\tau)^2\frac{1}{3L} + (1-(1-\tau)^2)\alpha) \tilde{\triangledown}_{1}\) |
类似的,Linear Coupling每次移动的步长也是历史梯度的组合,随着迭代次数的增加,\(x_{k+1}\)关于\(\tilde{\triangledown}_{1}\)的权重也是逐渐增加的。
\[\begin{split} x_{k+1} &= x_0-\sum_{i=1}^{k} ((1-\tau)^{k-i}\frac{1}{3L} +(1-(1-\tau)^{k-i})\alpha ) \tilde{\triangledown}_{i} 【LC】 \\ x_{k+1} &= x_0 - \sum_{i=0}^{k} (\gamma^0+...+\gamma^{k-i})g_i 【Momentum】 \end{split}\]两者在某种程度上可以说是等价的。其中\(\alpha =\frac{1}{3\tau L}\),当我们令\(g_i = \frac{1}{3L} \tilde{\triangledown}_{i}\),那么求和括号内项:
\[(1-\tau)^{k-i} +(1-(1-\tau)^{k-i})\frac{1}{\tau} \in (1,\frac{1}{\tau}) \quad (1)\] \[\gamma^0+...+\gamma^{k-i} = \frac{1-\gamma^n}{1-\gamma}\propto \frac{1}{1-\gamma} \in (1,+\infty)\](1)是根据\(\lambda a+ (1-\lambda)b \in (a,b)\),而\(\frac{1}{\tau}\in (1,+\infty)\)。可以看到实际上Linear Coupling也是一种momentum加速算法的一种形式。而作者发现把\(LC\)中的\(GD\)直接换成\(SGD\)时效果并不好,就像直接将Nesterov’s momentum加在\(SGD\)上时一样加速效果并不好。这是因为\(SGD\)梯度含有一定的噪声,而momentum是会记住历史梯度,它会让\(SGD\)继续往错误的方向加速前进。因此Zeyuan为\(SGD\)重新设计一个加速算法\(Katyusha\).
2.2 Katyusha Momentum
Katyusha momentum在上述2.1节momentum中添加了新的一项\(\tau_2\tilde{x}\).论文中作者取\(\tau_2=0.5\),
\[\quad x_{k+1} \leftarrow \tau z_k+0.5 \tilde{x} + (0.5-\tau)y_k\] \[x_{k+1} = x_0-\sum_{i=1}^{k} ((0.5-\tau)^{k-i}\frac{1}{3L} +(0.5-(0.5-\tau)^{k-i})\alpha ) \tilde{\triangledown}_{i}\]\(\tau_2\tilde{x} = x_0\),它依旧是\(z_k\)与\(y_k\)两者的线性耦合,只不过步长变成了原来的一半,\(\because \lambda_1+\lambda_1=k\),\(\lambda_1 a+\lambda_2 b \in (ka,kb)\),所以\(步长\in (\frac{1}{3L}\tilde{\triangledown}_{i} ,\alpha \tilde{\triangledown}_{i} )\Rightarrow (\frac{0.5}{3L}\tilde{\triangledown}_{i} ,0.5\alpha \tilde{\triangledown}_{i} )\)。即每执行一步momentum都往后向\(x_0\)的方向退一步,因此该步骤也叫做负动量Negative momentum.
如图,黑线(\(SGD\))为在\(x_{k}\)点走了一步负梯度方向得到\(y_k\),绿色线(momentum)为\(y_k\)与\(z_k\)组合后的\(x_2'\)即\(\tau_2=0\)的momentum加速,橙色线(Katyusha Momentum)为\(x_2'\)与起点\(x_0\)连线中点。
\[\begin{split} 黑SGD:&y_1 \leftarrow x_1 - \frac{1}{3L} \tilde{\triangledown}_{1} \\ 绿Momentum:&x_2' \leftarrow \tau y_1 + (1-\tau)z_1 = x_0-((1-\tau)\frac{1}{3L} +\tau\alpha ) \tilde{\triangledown}_{1} \\ 橙Katyusha Momentum:&x_2 \leftarrow x_0-\frac{1}{2}((1-\tau)\frac{1}{3L} +\tau\alpha ) \tilde{\triangledown}_{1} \end{split}\]在执行完\(m\)次内循环后,起始点\(\tilde{x}\)(磁铁)从该循环的终点开始继续下一轮的momentum更新。
2.3 收敛速率
\(SGD\)的发展主要分为三个阶段:传统\(SGD\)、方差缩减方法以及加速\(SGD\)。
\(SGD\)只需要计算单个样本比\(GD\)计算量小\(n\)倍,但其收敛速率仅为\(\frac{1}{\epsilon}\)尽管目标函数是强凸的。因此第二个阶段中以\(SVRG\)为代表的方差缩减方法将\(SGD\)的收敛速度提升到了\(O((n+\frac{L}{\sigma})\log\frac{1}{\epsilon})\),而它依赖于\(\kappa=\frac{L}{\sigma}\)我们称为“条件数量”,那么是否可以设计一个更快的算法达到\(\sqrt\kappa\)的依赖速度?\(APPA\)和\(Catalyst\)做到了也迎来了\(SGD\)的加速阶段,收敛速度提升到了\(O((n+\sqrt{n\kappa })\log\kappa \log\frac{1}{\epsilon})\),但是\(Catalyst\)依旧不够完美,\(Katyusha\)消去了\(\log\kappa\)达到了最优收敛速率\(O((n+\sqrt{n\kappa }) \log\frac{1}{\epsilon})\)。
Variance Reduction | Acceleration(momentum) | |
---|---|---|
\(SGD\) | \(SAG\) \(SVRG\) \(SAGA\) | \(APPA\) \(Catalyst\) \(Katyusha\) |
强凸光滑 | 强凸不光滑 | 非强凸光滑 | 非强凸不光滑 | |
---|---|---|---|---|
\(SGD\) | \(O(\frac{L^2}{\sigma \epsilon})\) | \(O(\frac{G}{\sigma \epsilon})\) | \(O(\frac{L^2}{\epsilon^2})\) | \(O(\frac{G}{ \epsilon^2})\) |
\(SVRG\) | \(O((n+\kappa )\log\frac{1}{\epsilon})\) | \(O(n\log\frac{1}{\epsilon}+\frac{G}{\sigma\epsilon})\) | \(O(n\log\frac{1}{\epsilon}+\frac{L}{\epsilon})\) | |
\(Katyusha\) | \(O((n+\sqrt{n\kappa })\log\frac{1}{\epsilon})\) | \(O(n\log\frac{1}{\epsilon}+ \sqrt{n\frac{L}{\epsilon})}\) |
Enjoy Reading This Article?
Here are some more articles you might like to read next: