I was amazed when word2vec first came out and it facinated me that such a powerful language model could be built so easily. I wanted to get a deeper understanding of the math behind it because I have some ideas on potential changes to it, so I thought I would try to implement the original paper. My implementation in pytorch here.
word2vec is actually 2 papers:
One of the papers is titled “Distributed Representations of Words and Phrases and their Compositionality”, their form of compositionality is interesting, but is not able to form the kinds of compositionality discussed in typical research such as systemacity and productivity. Still I wonder if word2vec could be modified to be more composable.
hierarchical softmax is not used even though it has a whole section in the paper.
They use negative sampling.
word2vec papers train 2 models, CBOW and skipgram.
CBOW is continuous bag of words meaning a vector is given and the order is not preserved. CBOW’s acts as prediction where you feed in random words and it tries to predict a single word.
skipgram is a model where you give it a target word and it predicts a series of words that are most likely to appear near that target word.
Word embeddings are one of the few currently successful applications of unsupervised learning. Their main benefit arguably is that they don’t require expensive annotation, but can be derived from large unannotated corpora that are readily available. Pre-trained embeddings can then be used in downstream tasks that use small amounts of labeled data.