Understanding LSTM

*By Guillaume Chevalier — File:The_LSTM_Cell.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=109362147*

The Cell state:

As RNN finds it difficult to carry previous information for a long input to the final state, so in LSTM we keep a separate state called cell state where previously learned information is available for the model.

The mechanism of how this cell state is maintained and the models ability to learn from new inputs is discussed below.

Selectively Removing Information from the Cell State:

At time step ‘t’, we have a previous cell state vector(or matrix)(ct-1) which has encoded features from all inputs before it.

We have a previous hidden state vector(ht-1) which encodes the influence of last input to the long term cell state.

We have a new input vector(xt) which should make necessary changes to the encoding done so far.

Both these vectors are transformed to the same vector space using two(W and U) learnable matrices(or dense layers) and we concatenate them. We called this vector st-1 = W.ht-1 + U.xt .

We have the cell state carrying long term information. We selectively want to remove some information from the long term state using the new information available from the new input.

The sigmoid operation(sigmoid activation layer) is performed on the concatenated vectors(st-1) which turns the individual elements between 0–1.

Elements close to zero can be interpreted as not important and close to 1 as important. Sigmoid function is differentiable so the whole mechanism can be learned through backpropagation.

So, the concatenated vector resulting from two learnable weight matrices is used to get individual elements’ weightage from sigmoid. So after element wise multiplication with the cell state vector, we will have elements close to 0 almost lost or forgotten from the cell state vector and elements close to 1 retained close to previous value.

So we have used information from the new input and previous hidden state output information to forget some information from the cell state. The information forgotten is long term information carried so far.

We call this mechanism the forget gate. This gate controls the flow of gradient for removing irrelevant information from the cell state.

Selectively Writing Information to the Cell State:

In the forget gate, information was removed from the cell state vector when multiplied by a number close to 0 but for multiplication of elements close to 1 no major changes were made.

So we were successful in elimination but not in adding substantial information from the new input.

So the concatenated vector is operated with a tanh function(tanh activation layer) so that we can add information in the range of -1 to 1 without drastically changing the cell state numbers. Also tanh is differentiable which makes things learnable.

The tanh output is weighted by sigmoid like in forget gate but with different learnable matrices(W,U) so that it learns to write/add relevant information to the long term cell state.

All the above mechanism is called as Input Gate. This gate controls the flow of gradient for inputing value to the cell state.

Getting The Hidden State:

In RNN we use previous hidden state and new input to get updated hidden state. The previous hidden state is supposed to carry all previous encoded information.

In LSTM we have so far updated our cell state, so we use this cell state to get our hidden state.

As the cell state is carrying additional information from all previous inputs, we eliminate some information using similar sigmoid function as done in the forget gate(using different W,U) and we do element wise multiplication of the sigmoid output with tanh of cell state so finally we have our hidden state giving us encoding influenced by the current input within range of 1 to -1.

This mechanism is called as output gate and it controls the flow of gradient for getting the hidden state.

Summary

With all the above mechanisms only selective elements are changed whereas in RNN the whole matrix is transformed. So we can use the final cell state as whole input encoding and individual hidden states as encoding for individual inputs.

Additional Resources:

Deep Learning(CS7015): Lec 14.2 Long Short Term Memory(LSTM) and Gated Recurrent Units(GRUs)

This article can also be found on my medium

Search This Blog

ai92

Understanding LSTM

Understanding LSTM

Comments

Post a Comment

Popular posts from this blog

vLLM Parameter Tuning for Better Performance

LLM Web Scraping - Webpage to LLM Friendly Text - Fully Open Source