# RNNs

$a$ is always related to the previous value before the activation function of the next layer is applied.

• $a^{}$ and $x_i$ can be joint into a single longer vector, with a matching W matrix.
• The loss function is calculated for the full sequence ### What you should remember:(RNNs)

• The recurrent neural network, or RNN, is essentially the repeated use of a single cell.
• A basic RNN reads inputs one at a time, and remembers information through the hidden layer activations (hidden states) that are passed from one time step to the next.
• The time step dimension determines how many times to re-use the RNN cell
• Each cell takes two inputs at each time step:
• The hidden state from the previous cell
• The current time step’s input data
• Each cell has two outputs at each time step:
• A hidden state
• A prediction

### Overview of gates and states

#### Forget gate $\mathbf{\Gamma}_{f}$

• Let’s assume you are reading words in a piece of text, and plan to use an LSTM to keep track of grammatical structures, such as whether the subject is singular (“puppy”) or plural (“puppies”).
• If the subject changes its state (from a singular word to a plural word), the memory of the previous state becomes outdated, so you’ll “forget” that outdated state.
• The “forget gate” is a tensor containing values between 0 and 1.
• If a unit in the forget gate has a value close to 0, the LSTM will “forget” the stored state in the corresponding unit of the previous cell state.
• If a unit in the forget gate has a value close to 1, the LSTM will mostly remember the corresponding value in the stored state.
##### Equation
$\mathbf{\Gamma}_f^{\langle t \rangle} = \sigma(\mathbf{W}_f[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_f)\tag{1}$
##### Explanation of the equation:
• $\mathbf{W_{f}}$ contains weights that govern the forget gate’s behavior.
• The previous time step’s hidden state $[a^{\langle t-1 \rangle}$ and current time step’s input $x^{\langle t \rangle}]$ are concatenated together and multiplied by $\mathbf{W_{f}}$.
• A sigmoid function is used to make each of the gate tensor’s values $\mathbf{\Gamma}_f^{\langle t \rangle}$ range from 0 to 1.
• The forget gate $\mathbf{\Gamma}_f^{\langle t \rangle}$ has the same dimensions as the previous cell state $c^{\langle t-1 \rangle}$.
• This means that the two can be multiplied together, element-wise.
• Multiplying the tensors $\mathbf{\Gamma}_f^{\langle t \rangle} * \mathbf{c}^{\langle t-1 \rangle}$ is like applying a mask over the previous cell state.
• If a single value in $\mathbf{\Gamma}_f^{\langle t \rangle}$ is 0 or close to 0, then the product is close to 0.
• This keeps the information stored in the corresponding unit in $\mathbf{c}^{\langle t-1 \rangle}$ from being remembered for the next time step.
• Similarly, if one value is close to 1, the product is close to the original value in the previous cell state.
• The LSTM will keep the information from the corresponding unit of $\mathbf{c}^{\langle t-1 \rangle}$, to be used in the next time step.

#### Candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$

• The candidate value is a tensor containing information from the current time step that may be stored in the current cell state $\mathbf{c}^{\langle t \rangle}$.
• The parts of the candidate value that get passed on depend on the update gate.
• The candidate value is a tensor containing values that range from -1 to 1.
• The tilde “~” is used to differentiate the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ from the cell state $\mathbf{c}^{\langle t \rangle}$.
##### Equation

$$\mathbf{\tilde{c}}^{\langle t \rangle} = \tanh\left( \mathbf{W}_{c} [\mathbf{a}^{\langle t - 1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{c} \right) \tag{3}$$

##### Explanation of the equation
• The tanh function produces values between -1 and 1.

#### Update gate $\mathbf{\Gamma}_{i}$

• You use the update gate to decide what aspects of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ to add to the cell state $c^{\langle t \rangle}$.
• The update gate decides what parts of a “candidate” tensor $\tilde{\mathbf{c}}^{\langle t \rangle}$ are passed onto the cell state $\mathbf{c}^{\langle t \rangle}$.
• The update gate is a tensor containing values between 0 and 1.
• When a unit in the update gate is close to 1, it allows the value of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ to be passed onto the hidden state $\mathbf{c}^{\langle t \rangle}$
• When a unit in the update gate is close to 0, it prevents the corresponding value in the candidate from being passed onto the hidden state.
• Notice that the subscript “i” is used and not “u”, to follow the convention used in the literature.
##### Equation
$\mathbf{\Gamma}_i^{\langle t \rangle} = \sigma(\mathbf{W}_i[a^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_i)\tag{2}$
##### Explanation of the equation
• Similar to the forget gate, here $\mathbf{\Gamma}_i^{\langle t \rangle}$, the sigmoid produces values between 0 and 1.
• The update gate is multiplied element-wise with the candidate, and this product ($\mathbf{\Gamma}_{i}^{\langle t \rangle} * \tilde{c}^{\langle t \rangle}$) is used in determining the cell state $\mathbf{c}^{\langle t \rangle}$.

#### Cell state $\mathbf{c}^{\langle t \rangle}$

• The cell state is the “memory” that gets passed onto future time steps.
• The new cell state $\mathbf{c}^{\langle t \rangle}$ is a combination of the previous cell state and the candidate value.
##### Equation
$\mathbf{c}^{\langle t \rangle} = \mathbf{\Gamma}_f^{\langle t \rangle}* \mathbf{c}^{\langle t-1 \rangle} + \mathbf{\Gamma}_{i}^{\langle t \rangle} *\mathbf{\tilde{c}}^{\langle t \rangle} \tag{4}$
##### Explanation of equation
• The previous cell state $\mathbf{c}^{\langle t-1 \rangle}$ is adjusted (weighted) by the forget gate $\mathbf{\Gamma}_{f}^{\langle t \rangle}$
• and the candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$, adjusted (weighted) by the update gate $\mathbf{\Gamma}_{i}^{\langle t \rangle}$

#### Output gate $\mathbf{\Gamma}_{o}$

• The output gate decides what gets sent as the prediction (output) of the time step.
• The output gate is like the other gates, in that it contains values that range from 0 to 1.
##### Equation
$\mathbf{\Gamma}_o^{\langle t \rangle}= \sigma(\mathbf{W}_o[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{o})\tag{5}$
##### Explanation of the equation
• The output gate is determined by the previous hidden state $\mathbf{a}^{\langle t-1 \rangle}$ and the current input $\mathbf{x}^{\langle t \rangle}$
• The sigmoid makes the gate range from 0 to 1.

#### Hidden state $\mathbf{a}^{\langle t \rangle}$

• The hidden state gets passed to the LSTM cell’s next time step.
• It is used to determine the three gates ($\mathbf{\Gamma}{f}, \mathbf{\Gamma}{u}, \mathbf{\Gamma}_{o}$) of the next time step.
• The hidden state is also used for the prediction $y^{\langle t \rangle}$.
##### Equation
$\mathbf{a}^{\langle t \rangle} = \mathbf{\Gamma}_o^{\langle t \rangle} * \tanh(\mathbf{c}^{\langle t \rangle})\tag{6}$
##### Explanation of equation
• The hidden state $\mathbf{a}^{\langle t \rangle}$ is determined by the cell state $\mathbf{c}^{\langle t \rangle}$ in combination with the output gate $\mathbf{\Gamma}_{o}$.
• The cell state state is passed through the tanh function to rescale values between -1 and 1.
• The output gate acts like a “mask” that either preserves the values of $\tanh(\mathbf{c}^{\langle t \rangle})$ or keeps those values from being included in the hidden state $\mathbf{a}^{\langle t \rangle}$

#### Prediction $\mathbf{y}^{\langle t \rangle}_{pred}$

• The prediction in this use case is a classification, so you’ll use a softmax.

The equation is: $$\mathbf{y}^{\langle t \rangle}_{pred} = \textrm{softmax}(\mathbf{W}_{y} \mathbf{a}^{\langle t \rangle} + \mathbf{b}_{y})$$

#### What you should remember:

• An LSTM is similar to an RNN in that they both use hidden states to pass along information, but an LSTM also uses a cell state, which is like a long-term memory, to help deal with the issue of vanishing gradients
• An LSTM cell consists of a cell state, or long-term memory, a hidden state, or short-term memory, along with 3 gates that constantly update the relevancy of its inputs:
• A forget gate, which decides which input units should be remembered and passed along. It’s a tensor with values between 0 and 1.
• If a unit has a value close to 0, the LSTM will “forget” the stored state in the previous cell state.
• If it has a value close to 1, the LSTM will mostly remember the corresponding value.
• An update gate, again a tensor containing values between 0 and 1. It decides on what information to throw away, and what new information to add.
• When a unit in the update gate is close to 1, the value of its candidate is passed on to the hidden state.
• When a unit in the update gate is close to 0, it’s prevented from being passed onto the hidden state.
• And an output gate, which decides what gets sent as the output of the time step

#### What you should remember:

• A sequence model can be used to generate musical values, which are then post-processed into midi music.
• You can use a fairly similar model for tasks ranging from generating dinosaur names to generating original music, with the only major difference being the input fed to the model.
• In Keras, sequence generation involves defining layers with shared weights, which are then repeated for the different time steps $1, \ldots, T_x$.

#### What you should remember (This applies only to tensorflow layers):

• If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly.
• Word embeddings allow your model to work on words in the test set that may not even appear in the training set.
• Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
• To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
• An Embedding() layer can be initialized with pretrained values.
• These values can be either fixed or trained further on your dataset.
• If however your labeled dataset is small, it’s usually not worth trying to train a large pre-trained set of embeddings.
• LSTM() has a flag called return_sequences to decide if you would like to return every hidden states or only the last one.
• You can use Dropout() right after LSTM() to regularize your network.

2.2.1 Block Training for BatchNormalization Layers

If you are going to fine-tune a pretrained model, it is important that you block the weights of all your batchnormalization layers. If you are going to train a new model from scratch, skip the next cell.

Date: August 24, 2022
Tags:

Previous
Using Internet Smarter