From RNN to Transformers (Without Math Jargon)

Author: Sadman Kabir Soumik

Transformer-based models are a types of neural network architecture that uses self-attention mechanisms to process input data. They were introduced in the paper "Attention Is All You Need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin in 2017, and have since become a popular choice for many natural language processing task.

Prerequisite: Before going further, I assume that you have a basic understanding of how neural networks work.

Before Transformers

Before transformer models were developedd, one of the most commonly used types of models for natural language processing tasks was the recurrent neural network (RNN).

So, What's an RNN and How Does It Work?

Let's look into a simple example.

Suppose you are reading a story and you are trying to understand the meaning of the story. You start reading the first sentence and try to understand its meaning . you read the second sentence, and you try to understand its meaning. Finally, you read the third sentence, and you try to understand its meaning. Now, after reading the entire story, you can understand the meaning of the story.

Similarly, RNNs work in the same way. They take input one at a time, and they use the previous input to help them understand the current input. In other words, RNNs use their memory to remember the previous input and use it to help understand the current input.

How the RNN is different from normal neural networks?

Normal neural networks are like calculators. They take some input and give you an output, but they don't remember anything from the previous calculations. RNNs are different because they have a memory that allows them to remember previous calculations and use that information to make better predictions or understand new information.

Normal neural networks take fixed-sized inputs and produce fixed-sized outputs. In contrast, RNNs can take inputs of any size and produce outputs of any size. Moreover, RNNs can handle sequential data, such as time-series data or natural language data, which is challenging for normal neural networks.

What are the different types of RNN?

There are different types of RNN, like LSTM and GRU, which are designed to help solve some of the problems of the basic RNN.

The basic RNN suffers from the vanishing gradient problem, which means that the gradients can become very small or zero, making it difficult to train the network, which makes it difficult to learn long-term dependencies in sequential data. This means that when trying to make predictions or generate outputs based on long sequences of past inputs, the model may have difficulty remembering information from earlier in the sequence.

LSTM and GRU are two types of RNN architectures designed to help alleviate the vanishing gradient problem by introducing mechanisms to selectively remember or forget information from previous inputs. This allows them to better capture long-term dependencies in sequential data.

How does LSTM is different from the Normal RNN?

LSTM (Long Short-Term Memory) is a type of RNN (Recurrent Neural Network) that is designed to address the problem of vanishing gradients in traditional RNNs. Here are the key differences between LSTM and Normal RNN:

Memory cells: In traditional RNNs, the hidden state at each timestep is used to pass information to the next timestep. However, in LSTMs, there is an additional memory cell that is used to pass information through time.

Gates: LSTMs have three gates that control the flow of information in and out of the memory cell.

  • Input gate: Determines which new information to store in the cell state.
  • Forget gate: Determines what information to discard from the cell state.
  • Output gate: Determines what information to output from the cell state.

Gradient flow: In traditional RNNs, the gradients can become very small as they are backpropagated through time, leading to the vanishing gradient problem. LSTMs address this problem by using the gates to selectively update the memory cell, which allows gradients to flow more easily through time.

Longer-term dependencies: Because of the memory cell and gates, LSTMs are better able to capture longer-term dependencies in sequential data than traditional RNNs.

Computationally more expensive: Because of the additional memory cell and gates, LSTMs are computationally more expensive than traditional RNNs.

Main Difference Between LSTM and GRU

The main difference between LSTM and GRU is that LSTM has two memory cells, while GRU has only one. LSTM is more complex and has more parameters than GRU, which makes it more powerful. However, GRU is simpler and faster to compute than LSTM.

The Basic Idea of Transformers

The Transformer architecture was introduced by Google researchers in 2017. It was designed to address some of the limitations of the previous state-of-the-art neural network models for NLP tasks, like Recurrent Neural Networks (RNNs).

The Transformer architecture is a newer and more advanced technology that has several advantages over RNN, especially when it comes to processing long sequences of data.

As we have already discussed above RNN works by passing information from one node to the next in a sequence, with each node processing a small piece of information and passing it on to the next node. This creates a kind of "memory" that allows the network to remember what has come before and use that information to make predictions about what will come next.

However, there are some limitations to RNN. One of the biggest is that it can struggle with processing very long sequences of data. This is because the network has to remember all of the previous information in the sequence, and as the sequence gets longer, the amount of information the network has to remember increases exponentially.

This is where the Transformer architecture comes in. It is designed to be much more efficient at processing long sequences of data, like entire paragraphs of text.

The Transformer architecture works by using a technique called "self-attention." This is where the network focuses on certain parts of the input sequence and gives them more weight when processing the data. Think of it like a teacher focusing on a student who is struggling with a particular concept and giving them extra attention to help them understand it better.

This self-attention mechanism allows the Transformer architecture to process long sequences of data much more efficiently than RNN. It can also process multiple parts of the sequence at the same time, which makes it even faster and more efficient.

One other advantage of the Transformer architecture is that it can be pre-trained on large amounts of data to learn the structure of language. This means that when it is used for a specific language processing task, like translation or question-answering, it already has a good understanding of how language works and can perform the task more accurately.

Architecture of Transformer

Imagine you are telling a story to your friends in class, and you want to make sure they understand every word you say. You might use a special notebook to write down all the words you want to use and their meanings, so you can remember them and explain them to your friends.

This notebook is like the "embedding layer" in the Transformer. It helps the computer understand the meaning of each word in a sentence by assigning each word a unique number or vector.

Now, imagine you want to tell your story to someone who doesn't speak the same language as you. You might use a translator to help you. The translator listens to what you say, and then says the same thing in a language the other person can understand.

The translator is like the "encoder" in the Transformer. It listens to the words you say, and then converts them into a format the computer can understand. The encoder does this by breaking the sentence into smaller parts and analyzing how they relate to each other.

Finally, imagine you want to write down your story in a different language, so people who speak that language can understand it. You might use a different notebook to write down the words in the new language and their meanings, so you can remember them and explain them to others.

This notebook is like the "decoder" in the Transformer. It takes the meaning of each word from the encoder and turns it into a sentence in a new language.

So, in short, the Transformer architecture is like a language translator. It uses an "embedding layer" to understand the meaning of words, an "encoder" to translate the words into a computer-readable format, and a "decoder" to turn the computer-readable format back into a sentence in a different language.

Technically, the Transformer architechture has two main components: the encoder and the decoder. The encoder processes the input data and the decoder generates the output.

  1. Encoder: The encoder takes in the input sequence, such as a sentence, and processes it into a set of "hidden" representations. These hidden representations capture the meaning of the input sequence and are used by the decoder to generate the output. The encoder is made up of several layers, each of which contains two sub-layers:
    • Multi-head Attention Layer: This layer performs the self-attention mechanism that we talked about earlier. It allows the model to focus on different parts of the input sequence and give them varying degrees of importance.
    • Feedforward Neural Network Layer: This layer applies a simple neural network to the output of the multi-head attention layer. It helps to further capture the meaning of the input sequence.
  2. Decoder: The decoder takes the hidden representations generated by the encoder and uses them to generate the output sequence, such as a translated sentence. The decoder is also made up of several layers, each of which contains three sub-layers:
    • Masked Multi-head Attention Layer: This layer performs a similar function to the multi-head attention layer in the encoder, but it is "masked" so that the model can only attend to parts of the output sequence that have already been generated.
    • Multi-head Attention Layer: This layer allows the model to attend to the hidden representations generated by the encoder. It helps the decoder to generate output sequences that are consistent with the input sequence.
    • Feedforward Neural Network Layer: This layer applies a simple neural network to the output of the multi-head attention layer. It helps to further capture the meaning of the input sequence.

Together, the encoder and decoder allow the Transformer model to process long sequences of data and generate accurate outputs. They also allow the model to be pre-trained on large amounts of data , which helps to improve its performance on specific language processing tasks.

In the Transformer architecture, "self-attention" refers to the mechanism by which the model determines which parts of the input sentence are most relevant to each other.

Let me explain it in simpler terms. When we read a sentence we usually pay more attention to its certain words that help us understand the meaning of the sentence. For example, in this sentence "The cat sat on the mat", We might pay more attention to the words "cat" and "mat" because they help us understand the subject and location of the sentence.

In the same way, the Transformer uses self-attention to determine which words in a sentence are most important for understanding the meaning of the sentence. It does this by computing a weighted sum of the input embeddings, where the weights are determined by how related each word is to the other words in the sentence.

The self-attention mechanism is called "self" attention because it pays attention to the input sentence itself, rather than any external information. It allows the models to focus on the most relevant part of the sentence, and ignore the less important ones, which can help improve the accuracy of the model's predictions.

In the Transformer architecture, self-attention is used in both the encoder and decoder components to determine which parts of the input sentence and output sentence are most relevant to each other. Specifically, the self-attention component is responsible for computing the attention weights and applying them to the input embeddings to obtain the final output embeddings.

There is an excellent blog named The Illustrated Transformer by Jay Alammar which explain the Transformer architecture in the simpliest way possible.

Why do transformer-based models work better than previous methods?

Transformer based models have proven to be very effective for many natural language processing tasks, such as machine translation and language modeling, because they are able to capture long-term dependencies in the input data.

Previous methods, such as recurrent neural networks (RNNs), struggled to capture these long-term dependencies because they processed the input data in a sequential manner. This means that the output at each step was only dependent on the input at that step and the previous hidden state. In contrast, transformer-based models use self-attention mechanisms, which allow the model to look at the entire input sequence at once and weight the different parts of the input according to their importance. This makes it possible for the model to capture long-term dependencies and produce more accurate predictions.

Additionally, transformer-based models are highly parallelisable, which makes them much more efficient to train and run on modern hardware, such as GPUs and TPUs. This has allowed them to applied to large-scale natural language processing tasks, such as machine translation, which require the ability to process a large amount of data quickly.

Major Differences Between RNN and Transformers

So, here are 5 major differences between RNN and Transformer models:

Recurrent Neural Networks (RNN) Transformer Models
Processing of input sequence Sequential processing, where the hidden state is updated based on the previous input and used to predict the next output Parallel processing, where each input is processed independently and the entire input sequence is used to generate the output
Handling of long sequences Can struggle to handle long sequences due to "vanishing gradients" problem, where gradients become very small and the network cannot learn effectively Can effectively handle long sequences due to self-attention mechanism, which allows the model to attend to different parts of the sequence
Memory of earlier inputs Hidden state can retain information from earlier inputs, but may struggle to retain information over long sequences Self-attention mechanism allows the model to retain information from all parts of the input sequence
Training time Can be slow to train, especially for long sequences, due to sequential processing Can be faster to train than RNNs, especially for long sequences, due to parallel processing
Applicability to different tasks Commonly used for sequential data tasks such as natural language processing, speech recognition, and time series analysis Primarily used for natural language processing tasks, but can also be used for image and video processing tasks

Various Transformer-based Models

  1. The BERT (Bidirectional Encoder Representations from Transformers) model, which was developed by Google and is a popular choice for many natural language understanding tasks.
  2. The GPT (Generative Pre-trained Transformer) model, which was developed by OpenAI and is a popular choice for many natural language generation tasks.
  3. The Transformer-XL model, which was developed by Google and is a variant of the original transformer model that is able to capture longer-term dependencies in the input data.
  4. The XLNet model, which was developed by Google and is a variant of the Transformer-XL model that is able to capture even longer-term dependencies in the input data.
  5. The RoBERTa (Robustly Optimized BERT) model, which was developed by Facebook and is a variant of the BERT model that is trained on a larger dataset and uses a different training objective.
  6. The ALBERT (A Lite BERT) model, which was developed by Google and is a smaller and more efficient variant of the BERT model.
  7. The T5 (Text-To-Text Transfer Transformer) model, which was developed by Google and is a multi-task model that can be fine-tuned for a wide range of natural language processing tasks.
  8. The BART (Denoising Autoencoding Representations from Transformers) model, which was developed by Facebook and is a denoising autoencoder that is trained to reconstruct the input data from corrupted versions of it.
  9. The ELECTRA (Efficiently Learning an Encoder that Classifies Tokens Accurately) model, which was developed by Google and is a generative model that is trained to produce high-quality text.

Author: Sadman Kabir Soumik

References:

[1] http://jalammar.github.io/illustrated-transformer/

[2] https://arxiv.org/abs/1706.03762

[3] https://youtu.be/kCc8FmEb1nY

Posts in this Series