Scalable Hierarchical Distributed Language Model - Mnih and Hinton 2009

This write-up contains my first impressions of the paper : A Scalable Hierarchical Distributed Language Model - (Mnih and Hinton 2009).

Paper Background

The main aim of the paper was to reduce the output stage of a next-word prediction task from an O(V) representation (where V is the vocabulary size) to a tree-based representation that is O(log(V)).

This tree creation algorithm was a two-stage method : stage 1 involved operating on a random tree, so that some vector embedding could be performed, then stage 2 produced a tree based on the word vectors found in stage 1, and then produced the final word embeddings from there.

Surprising stuff

The paper reported that if words were allowed to appear multiple times in the tree, the training didn't seem to pick up multiple senses : The duplicate entries favoured rare words, in similar settings, rather than common, multi-sense words
The issues addressed in building trees (eg: how to reasonably divide up the samples at the node between left and right children) were those common in the tree recursive methods (e.g. CART) community - and it seemed like a less-than-in-depth treatment of an area that is actively researched
Unlike some other papers, this one included generous attribution of other people's ideas, and a clear exposition of the reasoning behind each choice at each decision point
Hierarchical methods were plainly a computational win over the 'full-V' LBL model. However, it seemed like the authors had to 'pull out all the stops' to get perplexity performance to beat the complexity-competitive KN5 model
Also interesting, for using Restricted Boltzman Machines (though not explained here why they've "moved on", nor even mentioned, but to refer to previous benchmark results) : Three new graphical models for statistical language modelling - (Mnih and Hinton 2007)

Ideas to Follow-Up

The paper Learning word embeddings efficiently with noise-contrastive estimation - (Mnih 2013) seems to supercede this approach (the tree-element of which is also taken up in Efficient Estimation of Word Representations in Vector Space - (Mikolov et al, 2013)). Will look at the former next.