One of the reasons for training of deep networks for extended periods without labels is that pure backpropagation training has problems with the dilution of gradient information with each step backward through the network. This is why we add a training phase without any notion of target labels, making it possible for the trained network to become familiar with the input data so that it obtains some early initial abstractions, which are then easier to hone during a pure backpropagation phase.
Another way of looking at this is to consider the sources of noise in the system.
From a visual recognition task, there's a great deal of redundant and/or confounding information in the input pixel data and in the initial (random) weights for the neurons in the network. Getting the system into a usable initial condition for reliable back-propagation training is key to avoiding getting caught in local minima (where one is learning more about the noise than about the signal).
Rough progression visually:
Static vs. moving
Stuff vs. non-stuff (interesting vs. boring)
Face vs. non-face
Mummy vs. other people
Black-and-White vs. colours (which become interesting later)
Later on (skipping many stages):
Names for parts of the head
Big vs little - comparing objects
Mine vs. not-mine
Lining things up in rows
Singular vs. plural
These developmental stages seem pretty built-in as being 'interesting' at the time when the brain is ready to appreciate them. Of course, it may be that only a brain prepared with the previous stage is capable of learning the next.
Despite parents' best efforts, most of the labels (or preferences about what aspect of the world is being learned) seem to be internally generated by infants. The brain seems to have a lesson scheme built in - mapping out the right order to adsorb different lessons.
As a simple example, "The Hungry Catapillar" offers very different lessons to children of different ages:
a book is not for chewing
looks at pages means more time with parent
turning of pages
turning of pages when it's the right time
poking fingers through the holes
realising that books have a 'right way' up
understanding that the catapillar is same on each page
seeing the different fruit
understanding that the catapillar is eating
seeing that fingers come from one page to the next
understanding that there's a progression in stuff being eaten
realising that the words are the same each time
understanding that the catapillar is getting bigger
being able to fill in missing words
understanding that the catapillar sleeps after eating
understanding that the catapillar becomes a butterfly (this is an enormous jump)
Lots of development occurs without explicit labels. The brain has a built-in knack for knowing when the right time has arrived to chunk up data in ways that will be helpful later.
Building up a network's weights gradually
- solving the bulk problem first, followed by refinements - seems only natural. This may also ensure that the network doesn't get 'prematurely optimized' and cornered in a local minima that is detrimental to further learning progress.
An interesting avenue of inquiry is whether it's possible to find criteria that indicate when a given network is ready to 'move on' to the next stage of difficulty in learning. Can one detect when a network has started to over-learn, and use that signal to introduce the 'next level of difficulty' of input data to prevent this happening?