Batch Normalization - Accelerating Deep Network Training by Reducing Internal Covariate Shift

This write-up contains a few takeaways that I had from the recent paper : Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe & Szegedy 2015).

Whitening and Factorization

The whitening (per batch) & rescaling (overall) is a neat new idea. But (as referred to in their p5 comment about the bias term being subsumed) this also points to the idea that the (Wu+b) transformation probably has a better-for-learning 'factorization', since their un-scale/re-scale operation on (Wu+b) is mainly taking out such a factor (while also putting in the minibatch accumulation change).

Replacing Dropout ?

The idea that this could replace Dropout as the go-to trick for speeding up learning is pretty worrying (IMHO), since the gains from Dropout seem to be in a 'meta network' direction, rather than a data-dependency direction.

Both approaches seem well worth understanding more thoroughly, even though the 'do what works' ML approach might favour leaving Dropout behind.

Publication timing

The publication of this paper, so closely behind the new ReLu+ results from Microsoft, seems too coincidental. One has to wonder what other results each of the companies has in their back-pockets so that they can repeatedly steal the crown from each other.

Results for Large Models

For me, the application to MNIST is attention-grabbing enough.

While I appreciate that playing with Inception (etc) sexes-up the paper a lot, it raises the hurdle for others who may not have that quantity of hardware to contribute to the (more interesting) project of improving the learning rates of all projects.

For instance, working on the mechanisms of learning is quite possible to do on the MNIST dataset, except that MNIST is pretty much 'solved' (see (a) below) with the error cases being pretty questionable for humans too.

(a) According to Efficient batchwise dropout training using submatrices, a fully connected network with two hidden layers of 80 units each can learn to classify the MNIST training set perfectly in about 20 training epochs - unfortunately the test error is quite high, about 2%. Increasing the number of hidden units by a factor of 10 and using dropout results in a lower test error, about 1.1%.