Learning word embeddings efficiently with noise-contrastive estimation - Mnih 2013

This write-up contains my first impressions of the paper : Learning word embeddings efficiently with noise-contrastive estimation - (Mnih 2013)

Paper Background

Calculating word embeddings can be very computationally expensive, since 'true distributions' over the vocabulary take O(V)) time, unless a tree embedding is used, in which case O(log(V)) is possible.

Instead of going for the full distribution, however, a number of simplifying steps can be made :

Only ratios are required, so the (expensive) normalizing constants can be discarded
Full distributions can be approximated by taking samples : And this works really well.

Surprising stuff

The major surprise here is that the method works so well
Particularly since the 'sophisticated structure' approaches (such as trees) were thrown out in favour of just processing larger quantities of data more efficiently
The NCE idea had already been explored in a slightly different context in : Quick training of probabilistic neural nets by importance sampling - (Bengio and Senécal 2003).
Although the NCE is claimed to be constant time, there is a twist. The simple act of picking k weighted noise samples (each from a vocabulary of size V) could easily be an O(log(V)) operation. One way around this is to pick words sequentially (or with a stride, etc), from the corpus itself (since that would have the right statistics, by construction).
Experimental detail : Thinned out words by rejecting any with <10 occurrences

Ideas to Follow-Up

Multiple representations for different senses of words.

The basis of this work is analysed and recostructed within a more general setting in the GloVe paper : Essentially the same thing can computed more generally, faster, and with OSS code

Word embedding task
Simplified stochastically
Reveals true essence.