Published on

Learning without Labels

Avoiding local minima during childhood
Authors

One of the reasons for training of deep networks for extended periods without labels is that pure backpropagation training has problems with the dilution of gradient information with each step backward through the network. This is why we add a training phase without any notion of target labels, making it possible for the trained network to become familiar with the input data so that it obtains some early initial abstractions, which are then easier to hone during a pure backpropagation phase.

Local minima

Another way of looking at this is to consider the sources of noise in the system.

From a visual recognition task, there's a great deal of redundant and/or confounding information in the input pixel data and in the initial (random) weights for the neurons in the network. Getting the system into a usable initial condition for reliable back-propagation training is key to avoiding getting caught in local minima (where one is learning more about the noise than about the signal).

Childhood labels

Rough progression visually:

  • Static vs. moving

  • Stuff vs. non-stuff (interesting vs. boring)

  • Face vs. non-face

  • Mummy vs. other people

  • Black-and-White vs. colours (which become interesting later)

Later on (skipping many stages):

  • Peek-a-boo

  • Names for parts of the head

  • Big vs little - comparing objects

  • Mine vs. not-mine

  • Lining things up in rows

  • Singular vs. plural

These developmental stages seem pretty built-in as being 'interesting' at the time when the brain is ready to appreciate them. Of course, it may be that only a brain prepared with the previous stage is capable of learning the next.

Simple story

Despite parents' best efforts, most of the labels (or preferences about what aspect of the world is being learned) seem to be internally generated by infants. The brain seems to have a lesson scheme built in - mapping out the right order to adsorb different lessons.

As a simple example, "The Hungry Catapillar" offers very different lessons to children of different ages:

  • a book is not for chewing

  • looks at pages means more time with parent

  • turning of pages

  • turning of pages when it's the right time

  • poking fingers through the holes

  • realising that books have a 'right way' up

  • understanding that the catapillar is same on each page

  • seeing the different fruit

  • understanding that the catapillar is eating

  • seeing that fingers come from one page to the next

  • understanding that there's a progression in stuff being eaten

  • realising that the words are the same each time

  • understanding that the catapillar is getting bigger

  • being able to fill in missing words

  • understanding that the catapillar sleeps after eating

  • ...

  • understanding that the catapillar becomes a butterfly (this is an enormous jump)

Label-less learning

Lots of development occurs without explicit labels. The brain has a built-in knack for knowing when the right time has arrived to chunk up data in ways that will be helpful later.

Building up a network's weights gradually

  • solving the bulk problem first, followed by refinements - seems only natural. This may also ensure that the network doesn't get 'prematurely optimized' and cornered in a local minima that is detrimental to further learning progress.

An interesting avenue of inquiry is whether it's possible to find criteria that indicate when a given network is ready to 'move on' to the next stage of difficulty in learning. Can one detect when a network has started to over-learn, and use that signal to introduce the 'next level of difficulty' of input data to prevent this happening?