Simple Explanation of Long Short Term Memory (LSTM) in NLP
To understand the LSTM better ,lets understand the little bit about the RNNs and its drawback.
Recurrent Neural Networks (RNNs) are a type of neural network designed to handle sequential data by retaining information from previous inputs in a hidden state. They are commonly used for tasks like time series prediction, natural language processing, and speech recognition.
In an RNN, the output at each time step depends not only on the current input but also on the hidden state from the previous time step. This structure allows the model to maintain a form of memory over sequences, making it suitable for tasks where past information is important for predicting future outputs.
Drawbacks of RNNs:
- Vanishing and Exploding Gradient Problem:
- Difficulty with Long-Term Dependencies
- Sequential Processing(No Parallel Processing)
- Short-Term Memory
To come over the above short coming LSTM came into existence in 1995.
What is LSTM ?
LSTM stands for Long Short-Term Memory (LSTM) .A Long Short-Term Memory is a type of recurrent neural network (RNN) designed to overcome the limitations of standard RNNs, particularly in retaining information over long sequences. Unlike traditional RNNs, which struggle to remember information for extended periods due to the vanishing gradient problem , LSTMs have a built-in memory system that enables them to store, update, and retrieve relevant information more effectively.
Key components:
- Memory Cell(Ct): The core of the LSTM, responsible for retaining information over time. It helps the model “remember” important details over long sequences.
- Forget Gate: Decides what information from the memory cell to forget or discard based on the current input and previous hidden state. It helps the model forget irrelevant or outdated information.
- Input Gate: Determines what new information to store in the memory cell. This gate updates the memory cell with new, relevant data.
- Output Gate: Controls what information from the memory cell should be used to produce the current output. It determines how much of the memory should influence the next prediction.
Let’s break down the functioning of each gate in a simple LSTM:
1.Forget Gate:
- Purpose: Decides what information to discard from the cell state.
- How it works: It looks at the previous hidden state and the current input and passes them through a sigmoid function. This generates values between 0 and 1, where 0 means “forget everything” and 1 means “keep everything.”
Example: If the current subject changes in a language model, we might want to forget information about the previous subject (like gender).
Equation:
where σ — is the sigmoid function which converts values between 0 to 1,
Wf — weights associated with hidden state and current state
ht-1 — Output from previous time stamp also called hidden state passed as input.
Xt — current time stamp input
bf — bias value
2. Input Gate:
- Purpose: Decides what new information to store in the cell state.
- How it works: Two steps happen here:
- A sigmoid layer decides which parts of the new information to update.
- A tanh layer creates new candidate values to be added to the cell state.
Example: In a language model, if we encounter a new subject, the input gate would add the gender of the new subject to the cell state.
Equation:
Where Wi and Wc is weights and bi and bc is biased values.
Others ,same as forget gate.
3.Cell State Update:
- Purpose: Updates the cell state by combining the forget and input gates.
- How it works: The old cell state is multiplied by the forget gate output (to forget irrelevant information), and the result is added to the new candidate values (to store new relevant information).
4.Output Gate:
- Purpose: Determines what information will be output from the current time step.
- How it works: A sigmoid layer determines what parts of the cell state will be output, and the cell state is passed through a tanh function to scale the values between −1 and 1. The final output is a filtered version of the cell state.
Equation:
Benefits of Using LSTM:
- Handles Long-Term Dependencies: LSTMs excel at capturing long-range patterns in sequential data.
- Mitigates Vanishing Gradient Problem: LSTMs solve the vanishing gradient issue common in traditional RNNs.
- Selective Memory: LSTMs selectively keep or discard information using forget, input, and output gates.
- Effective for Sequential Data: Ideal for tasks like time series forecasting, speech recognition etc.
- Versatility: LSTMs are used for various sequence-based tasks such as classification, regression, text generation.
Shortcomings of LSTM:
- High Computational Cost: LSTMs are resource-intensive and slower to train due to their complex structure.
- Memory Consumption: They consume more memory, especially when handling long sequences or large datasets.
- Difficulty in Parallelization: LSTMs process data sequentially, making parallelization difficult and slowing training.
- Overfitting with Small Data: LSTMs tend to overfit on small datasets without proper regularization.
- Architecture Complexity: LSTMs are more complex and harder to tune compared to simpler recurrent models.
Conclusion:
LSTMs has done great job in dealing with long sequences but takes lots of time to run(still data was passed sequentially). To solve this problem, Transformers were introduced .Please find the link below .
I hope you like my article on LSTMs. Please click on Clap if you liked and want me to motivate to write more .Share with your friends too.
Want to connect :
Linked In : https://www.linkedin.com/in/anjani-kumar-9b969a39/
Thanks to colah blog for details: