The word embeddings by Bert [1], a transformers [2] based architecture for NLP tasks are known to capture the context in which the word is used. We explore how does the embedding space look by trying different combinations of sentences.


We know that a minor change in the sentence can drastically change the meaning of a word in that sentence. For example, if we consider the following two sentences

a) I am standing close to the door.

b) I am standing to close the door.

In these two sentences, the word "close" has an entirely different meaning. A simple word2vec like embeddings cannot capture this difference because it does not take the context into account. We perform the two experiments to understand how Bert deals with situations like this. In the first experiment, we study how the embeddings of a particular word differ based on the context in which it is used. In the second experiment, we study when do two very different words end up having similar embeddings due to the context.

Experiment 1: Will Bert embeddings of a given word with two different meanings differ significantly?

We consider 54 sentences that contain the word "close". In half of these sentences, the word "close" is used to convey that something is "near by" as shown in Sentence (a). In the remaining half, the word "close" is used to convey to "shut" something as shown in Sentence (b).

Then, we have obtained the Bert embeddings for these sentences using the "BERT-base-uncased" model by Huggingface [3]. The resulting embeddings are projected to a 2D plane using t-SNE which is shown in the figure below. The two colours represent the two different contexts in which the word close is used.

t-SNE plot of the embeddings of the word 'close' in two different contexts which mean "adjacent" and "shut down".

The cosine similarity of the embeddings of the word "close" in these 54 sentences is shown below. As seen from the matrix, left-top and the right-bottom parts of the matrix which correspond to cosine similarities within the same context has significantly high values compared to the the right-top or left-bottom parts which correspond to the cosine similarities of sentences across the contexts. Specifically, the average cosine similarity of a pair of sentences within the same context is observed to be 0.45, while it is 0.23 for two sentences which used the word in different context.

Cosine Similarity matrix of the embeddings of the word 'close' in two different contexts

The Bert architecture has several encoding layers and it is shown that the embeddings at different layers are useful for different tasks. In particular, it has been shown that embeddings in the initial layers are useful for low level tasks like PoS tagging, while the final layer embeddings are useful for more complex tasks [3, Table 7]. Here, we try to understand how the embeddings at each layer of the encoder distinguishes the context of the word. Specifically, we consider two sentences namely sentence a) and b) described initially and plot how the cosine similarity varies as a function of the layer number.

For example, consider the following three sentences.

sentence_a = "He is standing to close the door"
sentence_b = "He is standing to seal the door"
sentence_c = "He is standing close to the door"

If we look the cosine similarity for the word "close" in sentence_a and sentence_c, it can be seen from the graph that it decreases from 1 to 0.4, i.e., although the word2vec embeddings are same, and most of the surrounding words are same, the final embedding layer is able to distinguish it.

If we look at sentence_a and sentence_b, the cosine similarity of the word "seal" and "close" are close to zero at the initial layers of Bert and it increases to 0.65, i.e., although the word2vec embeddings are very different for "seal" and "close", the Bert architecture identifies that they are similar after processing through 5 to 6 encoding layers.

Cosine similarity as function of layer number 

Experiment 2: When do two different words in two different sentences have similar embeddings?

In this experiment, we have considered a corpus of sentences from a public dataset called fakenews and tried to identify pair of words that share very similar Bert embeddings. In that process, we found some pairs of words which do not look similar at the first instance but have very similar meaning in the context in which they have been used.

i) Allegations vs idea

Consider the word "allegations" in the sentence

"fbi are trying to restore lost public confidence over allegations of favoring hillary clinton"

When we search the entire corpus for the most similar Bert embeddings (cosine similarity) for the word "allegations" in the above sentence, we have got the following words:

"despite the grand accusations, no evidence or proof has been offered by the us government"
"because of weiner gate and the sexting scandal , they started investigating it"
"the white house and the hillary clinton campaign are now married to the idea that putin is hacking the us elections"

While its straight forward to see that the words "accusations" and "scandal" look similar to "allegations", it is not very obvious to see that the word "idea" is similar to the word "allegations" without looking at the corresponding sentences. Bert architecture has very well captured that similarity.

(ii) 11 vs a

"in a stunning turn of events 11 days before the 2016 presidential election , the fbi announced it is reopening its investigation"

When we considered the above sentence and searched for words with similar embeddings, we obtained the following words.

'this unprecedented investigative move comes just two days after a wiki'
'in a new york review article goldman sachs was already under investigation for committing fraud at least a year before the economic crash in'


We considered the Bert based word embeddings which are known to capture the context in which a particular word is used. We studied two questions namely, how a particular word ends up having different embeddings based on the context and how two different words end up in having similar embeddings based on the context.


  1. Bert original paper - "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
  2. Original Transformers architecture - "Attention Is All You Need"
  3. API for transformers by huggingface -