May 30, 2024

Word Embeddings: the Geometry of Words

Today, we’ll see words in orbit. No, I’m not taking you on a trip to the Milky Way. These words in orbit are called word embbedings. In one sentence, word embeddings are vectors that represent the words of a corpus (here, a corpus is a set of sentences). All words (yes, even those that are not in any dictionary) can be represented as word embeddings. The only thing you need is a large corpus. No, your cream fudge recipe is not a large enough corpus (unless it contains hundreds of thousands of words, of course).

This is all well and good, you may say, but what are word embeddings used for? In fact, they are very useful for measuring semantic relatedness (more simply, the meaning relation) between words. As examples, here are some pairs of semantically related words:

Tree / Forest
Tree / Branch
Tree / Maple
Tree / Plant

The proximity of vectors in the vector space can be interpreted as semantic relatedness: the vectors of the words found in similar contexts in the corpus tend to get closer to each other. Semantic relatedness between two words can then be evaluated with a measure like cosine similarity, Euclidean distance or Manhattan distance. Don’t know these measures? No problem, the characters of Kung Fu Panda prepared something for you:

Figure 1:

There are several models to train word embeddings. The best known are Word2Vec [1], GloVe [2] and FastText [3]. The word embeddings that will be introduced later in this article have been trained with Word2Vec.

Word2Vec comes in two flavors:

  • Continuous bag-of-words
  • Skip-gram

Figure 2 :

In fact, these two architectures are two slightly different ways to train word embeddings. Let’s take the following example:

The big ____ is barking.

You probably guessed that the word "dog" completes the sentence. This example corresponds to the training with the continuous bag-of-words architecture: the objective is to predict a word using the context.

When the continuous bag-of-words architecture is used, the context is treated as a single observation:

([the, big, is, barking], dog)

The model then evaluates the probability that the word "dog" appears in a context composed of the words "the", "big", "is" and "barking".

Let’s now take a look at this example:

____ big dog is barking.
The ____ dog is barking.
The big dog ____ barking.
The big dog is ____.

This example corresponds to the training with the skip-gram architecture: the objective is to predict the context using a single word (in this case, the word "dog").

Again, some more details for those interested in maths (click to open).

When the skip-gram architecture is used, each word of the context is treated as a new observation:

(dog, the), (dog, big), (dog, is), (dog, barking)

The model then evaluates the probability that "the" appears in the same context as "dog", that "big" appears in the same context as "dog", that "is" appears in the same context as "dog" and that "barking" appears in the same context as "dog".

Now, here are some details about the training of the word embeddings that will be presented in this article.

Step 1: Choosing the corpora

Two word embeddings models have been trained. The first model was trained on Wikipedia articles from the "Portail du Québec" (1,869,213 words). The articles have been extracted from the Wikipédia-FR corpus [4]. The second model was trained on about 4,400 articles from the "Journal de Québec" and the "Journal de Montréal" (2,200,642 words). A web scraper was used to extract the data.

Step 2: Preprocessing the data

Both corpora were segmented into sentences and the sentences were tokenized. The characters were all converted to lowercase and the tokens corresponding to a punctuation mark were removed.

Step 3: Choosing the architecture and setting the parameters

Both word embeddings models were trained with the implementation of Word2Vec in Gensim (Python library) [5]. In both cases, the skip-gram architecture was used (but it is a purely arbitrary choice).

For those interested, here are the parameters used (click to open).

model = Word2Vec(sentences, size = 250, alpha = 0.025, min_alpha = 0.00025,                                          min_count = 5, window = 3, sg = 1)
model.train(sentences, total_examples = len(sentences), epochs = 80)

Step 4: Training

Now, let your watch’s hand complete a few circles… the seconds hand, don’t worry.


Step 5: Converting each model into two files

The training is over, both models are saved, it's time to have fun. With Gensim's word2vec2tensor script, each model is converted into two TSV files: a tensor file and a metadata file. These files are necessary to visualize word embeddings with the Tensorflow Embedding Projector [6].

Step 6: Visualization with the Tensorflow Embedding Projector

Get ready to see the most beautiful point clouds.

Here’s the one for the word embeddings trained on Wikipedia articles:

lien tensorflow

And the one for the word embeddings trained on the articles of the "Journal de Québec" and the "Journal de Montréal":

lien tensorflow

In the right column, you can search for a word and find its closest neighbors in the vector space. You can also choose to display the cosine similarity or the Euclidean distance.

Let’s now take a look at some examples (in French). Let's search for the word "neige" (snow) in both models and see the result:

Wikipedia : tempête (storm), arpents (acres), quantité (quantity)…
Newspapers : tempête (storm), chutes (falls), glace (ice)…

Let's now try with a proper name and search for "dion":

Wikipedia : céline, jeik, incognito…
Newspapers : céline, clermont, tremblay…

That's pretty good, isn’t it? Word embeddings probably seem harmless to you. But beware, they are sometimes dangerous! Indeed, they can be easily influenced and are happy to reflect the ideas of the corpus on which they were trained. In short, objectivity and word embeddings are not always good friends. Let's see some examples.

Let's search for the word "étudiants" (students) first:

Wikipedia : élèves (pupils), inscrits (enrolled), professeurs (professors)…
Newspapers : élèves (pupils), grève (strike), jeunes (youths)…

But what is the word "grève" (strike) doing there? If you live in Quebec, you know the answer…

Let's now search for the word "religieux" (religious):

Wikipedia : musée (museum), personnel (staff), citoyens (citizens)…
Newspapers : signes (symbols), ostentatoires (ostentatious), intégrisme (fundamentalism)…

Oh dear… I’m sure you’ve understood that the word embeddings trained on the articles of the "Journal de Québec" and the "Journal de Montréal" have been strongly influenced by the highly publicized events in Quebec.

In conclusion, if you want to work with word embeddings, take the time to choose your corpus and be careful!

References mentioned:

[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Paper presented at: ICLR Workshop 2013, Scottsdale, USA.

[2] Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Paper presented at: 19th Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar.

[3] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135-146.




To find out more:

More details about word embeddings:

More details about Word2Vec:

Implementation of Word2Vec in Gensim:


TensorFlow tutorial about Word2Vec:

Suggested books:

Ganegedara, T. (2018). Natural Language Processing with TensorFlow: Teach language to machines using Python's deep learning library. Birmingham, UK: Packt Publishing. (Chap. 3: Word2vec – Learning Word Embeddings).


Goyal, P., Pandey, S., & Jain, K. (2018). Deep Learning for Natural Language Processing: Creating Neural Networks with Python. New York, USA: Apress. (Chap. 2: Word Vector Representations).


Jurafsky, D., & Martin, J. H. (2018). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Third Edition Draft. (Chap. 6: Vector Semantics).

Continue reading