Understanding Transformers in Simple Way

7 min readSep 10, 2024

Understanding Transformers in Simple Way

To understand transformers, lets first understand the problems with previous Models/Architectures.

1.Recurrent Neural Network (RNN):

A Recurrent Neural Network (RNN) is a type of neural network designed for sequential data, such as time series or language. It maintains a hidden state that captures information from previous inputs, allowing it to learn temporal patterns. RNNs use shared weights across time steps, making them effective for tasks where the order of data matters, like speech recognition or text generation.

Issues with RNNs

1a) They can not remember long sentences
1b) They can not be parallelized.

To solve this LSTM was introduced.

2.Long Short-Term Memory (LSTM):

Long Short-Term Memory (LSTM) is a specialized type of Recurrent Neural Network (RNN) designed to better capture long-term dependencies in sequential data. It achieves this by using a memory cell and three gates (input, output, and forget gates) to control the flow of information. These gates regulate which information is kept, updated, or discarded, making LSTMs effective for tasks like time series prediction, language modeling, and speech recognition where long-range dependencies are important.

LSTMs were able to remember information ,little bit longer time then RNNs, but even LSTMs having below Issues

2a) They are more complex than traditional RNNs and need larger amounts of training data to learn effectively
2b) They are unsuited for online learning tasks like prediction or classification with non-sequential data.
2c) LSTM takes longer time to train.

To solve these issues here comes Transformers.

3. Transformers :
Transformers are type of Neural Network Architecture which transforms a input sequence into an output sequence. They does this by understanding the context and tracking relationship between data.

Basically they can learn long-term dependencies between words given into a sentence, which make them powerful for task like
machine translation, text summarization and questions answering systems

Transformers works on attention mechanism to remember information.
They do not have any recurrence as in RNNs. Due to this they are faster to train and can be parallelized.(We will discuss about attention in details below)

Now lets discuss in details about Transformer Architecture.

Transformer Architecture(source: internet)

Lets breakdown the this architecture to understand it fully.

Transformer is composed on Encoders and Decoders or Its also called Encoder-Decoder Architecture .

But as per the Original Paper Attention is All You Need https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

we have pairs of 6 Encoders and 6 Decoders.

Each Encoder contains one multi head attention layer and one feed forward neural network layer and two Add & Norm layer.

Each Decoder contains two multi head attention layer and one feed forward neural network layer and three Add & Norm layer.

Working Steps as follows.

We feed all the words of the sentence to the multi-head attention layer together. Here all the words of the sentence is compared with all the other words of the sentence
Result of multi-head attention layer(step 1) is passed to the feed forward neural network separately .So they do not have any information exchanged.
Result of feed forward neural network (from step 2) is passed to next encoder and so on. Remember we have 6 encoders.(In each encoder, neural network used is same but since we have 6 encoders we have six different neural network)
Result of final Encoder is passed as input to Decoder’s multi-head attention layer.
Result obtained from Decoder’s multi-head attention layer is passed to feed forward neural network.
Linear and Softmax layer work is ued to predict the class of an input sequence.Example in language translation its the probability of target word.Basically probability of each word is assigned and word with highest probability is returned as output.

Attention Mechanisms in Transformers:

What is Attention ?
Attention is an ability of model to pay attention to the important part of the sentence or an image or any input.

In the attention layer,internally SCALED-DOT-PRODUCT attention is used. This is done multiple times to create the effect of muti-head attention.

Query, Key, Value (QKV):

In transformers, attention mechanisms use QKV vectors to manage information. Imagine it as a dictionary where you have a query (what you’re searching for), keys (where to find the relevant information), and values (the actual information). The query helps you locate the relevant keys, and each key has an associated value that provides the needed information.Please find details below.

Query Vector: The query vector represents the information you’re looking to discover or understand.It helps in finding connections between
this query and other elements.

Imagine you’re translating a sentence from English to French. The query vector focuses on a particular word you want to translate (like “apple”). The query helps figure out which other words (keys) in the sentence are important for understanding or translating the word “apple.”

Key Vector: The key vector represents each word in the input. It is used to check how closely each word (key) is related to the query word.

For example, if you have the words [“cat”, “apple”, “tree”, “juice”], the key vector helps you determine how important each of these words is when trying to understand or translate the query word, like “apple.”

Value Vector: The value vector holds the actual information you’ll use, like the translations of the words. The value vectors are weighted depending on how relevant the query and key vectors are to each other.

For example, if “apple” is closely related to its key, the value vector (its translation, “pomme” in French) will be given more importance in the final translation.

In this case, the values could be [“chat”, “pomme”, “arbre”, “jus”] — the
French translations of [“cat”, “apple”, “tree”, “juice”].

In short, the query asks for information, the keys help check relevance, and the values provide the answer based on the importance calculated.

Scaled Dot-Product Attention:

At the core of attention is scaled dot-product attention, which measures how closely related the query is to each key-value pair. This involves calculating the dot product between the query and each key, scaling this by the square root of the key’s dimension, and then applying a softmax function to obtain attention weights. These weights determine how much each value contributes to the final output.

Weighted Output:

The final output of the attention mechanism is a combination of the values weighted by the attention scores. Higher scores mean that the corresponding values are more influential, enabling the model to focus on the most relevant parts of the input sequence for each step of processing.

Inputs :

All the input which goes to the encoder is embedded. Embedding is technique to convert words to its n-length vectors representation(Here in transformers architecture they are using 512 Length vector representation). After this, Positional Encoding is applied to tell the model about the position of the word in the given sentence.

Multi-Head Attention:

In Multi-head attention layer, each word in the sentence is compared with all the other words of the sentence.

Masked Multi-head Attention:

Masked multihead attention in transformers helps the model focus only on past and current words when predicting the next word, preventing it from seeing future words. This is crucial for generating text in the right order.Basically,all the words coming before the current word is compared.

How It Works:
1)During training, a mask is applied to the attention matrix, setting the attention weights for future tokens (the words that come later in the sequence) to zero.

2)This ensures that when the model is generating or predicting tokens step by step, it’s only relying on the words it has already seen.

Add & Normalization Layer:

Add and Normalization layer is used to standardize the data to a common scale.

Outputs:

In the end of the decoder we have Linear Layer and Softmax Layer

Linear Layer:
Think of a linear layer as a simple math operation. It takes some input and multiplies it by numbers (called weights), then adds another number (called a bias). This changes the shape or size of the data.
In transformers, these layers help modify or process the information at different points, like after attention calculations or before generating the final output.
Softmax Layer:
The softmax layer takes the final output from the linear layer and converts it into probabilities. It ensures all the values add up to 1, which makes it easier to interpret as “which option is the most likely.”
For example, in a translation task, the softmax layer helps the model decide which word is the most likely next word.

Final Output: After softmax, the output is a set of probabilities. If you’re generating text, the word with the highest probability is chosen as the next word.

I hope you like my article on Transformers. Please click on Clap if you liked and want me to motivate to write more .Share with your friends too.

Want to connect :

Linked In : https://www.linkedin.com/in/anjani-kumar-9b969a39/

Understanding Transformers in Simple Way

Attention Mechanisms in Transformers:

Inputs :

Multi-Head Attention:

Masked Multi-head Attention:

Add & Normalization Layer:

Outputs:

Written by Anjani Kumar