Published on

Learning word embeddings efficiently with noise-contrastive estimation - Mnih 2013

Paper Takeaway

This write-up contains my first impressions of the paper : Learning word embeddings efficiently with noise-contrastive estimation - (Mnih 2013)

Paper Background

Calculating word embeddings can be very computationally expensive, since 'true distributions' over the vocabulary take O(V)) time, unless a tree embedding is used, in which case O(log(V)) is possible.

Instead of going for the full distribution, however, a number of simplifying steps can be made :

  • Only ratios are required, so the (expensive) normalizing constants can be discarded

  • Full distributions can be approximated by taking samples : And this works really well.

Surprising stuff

  • The major surprise here is that the method works so well

  • Particularly since the 'sophisticated structure' approaches (such as trees) were thrown out in favour of just processing larger quantities of data more efficiently

  • The NCE idea had already been explored in a slightly different context in : Quick training of probabilistic neural nets by importance sampling - (Bengio and Senécal 2003).

  • Although the NCE is claimed to be constant time, there is a twist. The simple act of picking k weighted noise samples (each from a vocabulary of size V) could easily be an O(log(V)) operation. One way around this is to pick words sequentially (or with a stride, etc), from the corpus itself (since that would have the right statistics, by construction).

  • Experimental detail : Thinned out words by rejecting any with <10 occurrences

Ideas to Follow-Up

Multiple representations for different senses of words.

The basis of this work is analysed and recostructed within a more general setting in the GloVe paper : Essentially the same thing can computed more generally, faster, and with OSS code

Word embedding task
Simplified stochastically
Reveals true essence.