{1543} revision 1 modified: 07-21-2021 03:07 gmt

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables

  • This paper is sort-of interesting: rather than back-propagating the errors, you optimize auxiliary variables, pre-nonlinearity 'codes' in a last-to-first layer order. The optimization is done to minimize a multimodal logistic loss function; math is not done to minimize other loss functions, but presumably this is not a limit. The loss function also includes a quadratic term on the weights.
  • After the 'codes' are set, optimization can proceed in parallel on the weights. This is done with either straight SGD or adaptive ADAM.
  • Weight L2 penalty is scheduled over time.

This is interesting in that the weight updates can be cone in parallel - perhaps more efficient - but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to auto-diff + backprop, I can't see this being adopted broadly.

That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices.