Microsoft Research Asia (MSRA) recently blew everyone away with their results on the ImageNet and COCO datasets. If you haven’t yet seen the work check out the paper here and especially some of the examples in the presentation here (Note: I’ve borrowed some of their figures below). Computer vision has come a long way, color me impressed!

Their basic claim is that a deeper network should in principle be able to learn anything a shallower network can learn: if the additional layers simply performed an identity mapping, the deeper network would be functionally identical to a shallower network. However, they show empirically that increases in network depth beyond a certain point make deeper networks perform worse than shallower networks. Despite modern optimization techniques like batch normalization and Adam, it seems ultra-deep networks with a standard architecture are fundamentally hard to train with gradient methods.It would seem that for standard architectures and training methods, we’ve passed the point of diminishing returns and started to regress. The theoretical benefits of increased depth, will never be realized unless we do something differently. We must go deeper!

The MSRA paper makes a simple proposal based on their insight that the additional layers in deeper networks need only perform identity transforms to perform as well as shallower networks. Because the deeper networks seem to have a hard time discovering the identity transform on their own, they simply build the identity transform in! The layers in the neural network now learn a “residual” function F(x) to add to the identity transform. To perform an identity transform only, the network only needs to force the weights in F(x) to zero. The basic building block of their Deep Residual Learning network is:

A similar line of reasoning also led to the recently proposed Highway networks. The main difference is that Highway networks have an additional set of weights that control the switching between, or mixing, of x and F(x).

My first reaction to the residual learning framework, was “that’s an interesting hack, I’m amazed it seems to work as well as it does”. Now don’t get me wrong, I love a simple and useful hack (*cough* dropout *cough* batch normalization *cough*) as much as the next neural net aficionado. But on further consideration, it occurred to me there is an interesting way to look at what is going on in the residual learning blocks in terms of theoretical neuroscience.

Below is a cartoon model of some of the basic computations believed to take place within a cortical area of the brain. (A couple examples of research in this area can be found here and here.) The responses of pyramidal cells, the main output cells of cortex (shown as triangles), are determined by their input as well as modulations due locally recurrent interactions with inhibitory cells (shown as circles) and each other. I hope the analogy I’m making to the components of the residual learning block is made clear by the color coding. Basically the initial activation, shown in red and indicated by x, is due to the input. This initial activation triggers the recurrent connections which are a nonlinear function of the initial activation, shown in blue and represented by F(x). The final output, shown in purple, is simply the sum of the input driven activity, x, and the recurrently driven activity, F(x).The residual learning blocks can thus be thought of as implementing locally recurrent processing, perhaps analogously to how it happens in the brain! The input to a brain area/residual learning block is processed recurrently before being passed to the next brain area/residual learning block. Obviously, the usual caveats apply: the processing in the brain is dynamic in time and is much more complicated and nonlinear, there is no account of feedback here, etc. However, I think this analogy might be a useful, and biologically plausible, way to understand the success of deep residual learning.