# (Google ML Bootcamp) Notes on DLS Coursera - V

# Course V - Sequence Models

# RNNs

$a$ is always related to the previous value before the activation function of the next layer is applied.

- $a^{
}$ and $x_i$ can be joint into a single longer vector, with a matching W matrix. - The loss function is calculated for the full sequence

### What you should remember:(RNNs)

- The recurrent neural network, or RNN, is essentially the repeated use of a single cell.
- A basic RNN reads inputs one at a time, and remembers information through the hidden layer activations (hidden states) that are passed from one time step to the next.
- The time step dimension determines how many times to re-use the RNN cell
- Each cell takes two inputs at each time step:
- The hidden state from the previous cell
- The current time step’s input data

- Each cell has two outputs at each time step:
- A hidden state
- A prediction

### Overview of gates and states

#### Forget gate $\mathbf{\Gamma}_{f}$

- Let’s assume you are reading words in a piece of text, and plan to use an LSTM to keep track of grammatical structures, such as whether the subject is singular (“puppy”) or plural (“puppies”).
- If the subject changes its state (from a singular word to a plural word), the memory of the previous state becomes outdated, so you’ll “forget” that outdated state.
- The “forget gate” is a tensor containing values between 0 and 1.
- If a unit in the forget gate has a value close to 0, the LSTM will “forget” the stored state in the corresponding unit of the previous cell state.
- If a unit in the forget gate has a value close to 1, the LSTM will mostly remember the corresponding value in the stored state.

##### Equation

\[\mathbf{\Gamma}_f^{\langle t \rangle} = \sigma(\mathbf{W}_f[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_f)\tag{1}\]##### Explanation of the equation:

- $\mathbf{W_{f}}$ contains weights that govern the forget gate’s behavior.
- The previous time step’s hidden state $[a^{\langle t-1 \rangle}$ and current time step’s input $x^{\langle t \rangle}]$ are concatenated together and multiplied by $\mathbf{W_{f}}$.
- A sigmoid function is used to make each of the gate tensor’s values $\mathbf{\Gamma}_f^{\langle t \rangle}$ range from 0 to 1.
- The forget gate $\mathbf{\Gamma}_f^{\langle t \rangle}$ has the same dimensions as the previous cell state $c^{\langle t-1 \rangle}$.
- This means that the two can be multiplied together, element-wise.
- Multiplying the tensors $\mathbf{\Gamma}_f^{\langle t \rangle} * \mathbf{c}^{\langle t-1 \rangle}$ is like applying a mask over the previous cell state.
- If a single value in $\mathbf{\Gamma}_f^{\langle t \rangle}$ is 0 or close to 0, then the product is close to 0.
- This keeps the information stored in the corresponding unit in $\mathbf{c}^{\langle t-1 \rangle}$ from being remembered for the next time step.

- Similarly, if one value is close to 1, the product is close to the original value in the previous cell state.
- The LSTM will keep the information from the corresponding unit of $\mathbf{c}^{\langle t-1 \rangle}$, to be used in the next time step.

#### Candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$

- The candidate value is a tensor containing information from the current time step that
**may**be stored in the current cell state $\mathbf{c}^{\langle t \rangle}$. - The parts of the candidate value that get passed on depend on the update gate.
- The candidate value is a tensor containing values that range from -1 to 1.
- The tilde “~” is used to differentiate the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ from the cell state $\mathbf{c}^{\langle t \rangle}$.

##### Equation

\(\mathbf{\tilde{c}}^{\langle t \rangle} = \tanh\left( \mathbf{W}_{c} [\mathbf{a}^{\langle t - 1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{c} \right) \tag{3}\)

##### Explanation of the equation

- The
*tanh*function produces values between -1 and 1.

#### Update gate $\mathbf{\Gamma}_{i}$

- You use the update gate to decide what aspects of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ to add to the cell state $c^{\langle t \rangle}$.
- The update gate decides what parts of a “candidate” tensor $\tilde{\mathbf{c}}^{\langle t \rangle}$ are passed onto the cell state $\mathbf{c}^{\langle t \rangle}$.
- The update gate is a tensor containing values between 0 and 1.
- When a unit in the update gate is close to 1, it allows the value of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ to be passed onto the hidden state $\mathbf{c}^{\langle t \rangle}$
- When a unit in the update gate is close to 0, it prevents the corresponding value in the candidate from being passed onto the hidden state.

- Notice that the subscript “i” is used and not “u”, to follow the convention used in the literature.

##### Equation

\[\mathbf{\Gamma}_i^{\langle t \rangle} = \sigma(\mathbf{W}_i[a^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_i)\tag{2}\]##### Explanation of the equation

- Similar to the forget gate, here $\mathbf{\Gamma}_i^{\langle t \rangle}$, the sigmoid produces values between 0 and 1.
- The update gate is multiplied element-wise with the candidate, and this product ($\mathbf{\Gamma}_{i}^{\langle t \rangle} * \tilde{c}^{\langle t \rangle}$) is used in determining the cell state $\mathbf{c}^{\langle t \rangle}$.

#### Cell state $\mathbf{c}^{\langle t \rangle}$

- The cell state is the “memory” that gets passed onto future time steps.
- The new cell state $\mathbf{c}^{\langle t \rangle}$ is a combination of the previous cell state and the candidate value.

##### Equation

\[\mathbf{c}^{\langle t \rangle} = \mathbf{\Gamma}_f^{\langle t \rangle}* \mathbf{c}^{\langle t-1 \rangle} + \mathbf{\Gamma}_{i}^{\langle t \rangle} *\mathbf{\tilde{c}}^{\langle t \rangle} \tag{4}\]##### Explanation of equation

- The previous cell state $\mathbf{c}^{\langle t-1 \rangle}$ is adjusted (weighted) by the forget gate $\mathbf{\Gamma}_{f}^{\langle t \rangle}$
- and the candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$, adjusted (weighted) by the update gate $\mathbf{\Gamma}_{i}^{\langle t \rangle}$

#### Output gate $\mathbf{\Gamma}_{o}$

- The output gate decides what gets sent as the prediction (output) of the time step.
- The output gate is like the other gates, in that it contains values that range from 0 to 1.

##### Equation

\[\mathbf{\Gamma}_o^{\langle t \rangle}= \sigma(\mathbf{W}_o[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{o})\tag{5}\]##### Explanation of the equation

- The output gate is determined by the previous hidden state $\mathbf{a}^{\langle t-1 \rangle}$ and the current input $\mathbf{x}^{\langle t \rangle}$
- The sigmoid makes the gate range from 0 to 1.

#### Hidden state $\mathbf{a}^{\langle t \rangle}$

- The hidden state gets passed to the LSTM cell’s next time step.
- It is used to determine the three gates ($\mathbf{\Gamma}
*{f}, \mathbf{\Gamma}*{u}, \mathbf{\Gamma}_{o}$) of the next time step. - The hidden state is also used for the prediction $y^{\langle t \rangle}$.

##### Equation

\[\mathbf{a}^{\langle t \rangle} = \mathbf{\Gamma}_o^{\langle t \rangle} * \tanh(\mathbf{c}^{\langle t \rangle})\tag{6}\]##### Explanation of equation

- The hidden state $\mathbf{a}^{\langle t \rangle}$ is determined by the cell state $\mathbf{c}^{\langle t \rangle}$ in combination with the output gate $\mathbf{\Gamma}_{o}$.
- The cell state state is passed through the
`tanh`

function to rescale values between -1 and 1. - The output gate acts like a “mask” that either preserves the values of $\tanh(\mathbf{c}^{\langle t \rangle})$ or keeps those values from being included in the hidden state $\mathbf{a}^{\langle t \rangle}$

#### Prediction $\mathbf{y}^{\langle t \rangle}_{pred}$

- The prediction in this use case is a classification, so you’ll use a softmax.

The equation is: \(\mathbf{y}^{\langle t \rangle}_{pred} = \textrm{softmax}(\mathbf{W}_{y} \mathbf{a}^{\langle t \rangle} + \mathbf{b}_{y})\)

#### What you should remember:

- An LSTM is similar to an RNN in that they both use hidden states to pass along information, but an LSTM also uses a cell state, which is like a long-term memory, to help deal with the issue of vanishing gradients
- An LSTM cell consists of a cell state, or long-term memory, a hidden state, or short-term memory, along with 3 gates that constantly update the relevancy of its inputs:
- A
**forget**gate, which decides which input units should be remembered and passed along. It’s a tensor with values between 0 and 1.- If a unit has a value close to 0, the LSTM will “forget” the stored state in the previous cell state.
- If it has a value close to 1, the LSTM will mostly remember the corresponding value.

- An
**update**gate, again a tensor containing values between 0 and 1. It decides on what information to throw away, and what new information to add.- When a unit in the update gate is close to 1, the value of its candidate is passed on to the hidden state.
- When a unit in the update gate is close to 0, it’s prevented from being passed onto the hidden state.

- And an
**output**gate, which decides what gets sent as the output of the time step

- A

#### What you should remember:

- A sequence model can be used to generate musical values, which are then post-processed into midi music.
- You can use a fairly similar model for tasks ranging from generating dinosaur names to generating original music, with the only major difference being the input fed to the model.
- In Keras, sequence generation involves defining layers with shared weights, which are then repeated for the different time steps $1, \ldots, T_x$.

#### What you should remember (This applies only to tensorflow layers):

- If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly.
- Word embeddings allow your model to work on words in the test set that may not even appear in the training set.
- Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
- To use mini-batches, the sequences need to be
**padded**so that all the examples in a mini-batch have the**same length**. - An
`Embedding()`

layer can be initialized with pretrained values.- These values can be either fixed or trained further on your dataset.
- If however your labeled dataset is small, it’s usually not worth trying to train a large pre-trained set of embeddings.

`LSTM()`

has a flag called`return_sequences`

to decide if you would like to return every hidden states or only the last one.- You can use
`Dropout()`

right after`LSTM()`

to regularize your network.

- To use mini-batches, the sequences need to be

2.2.1 Block Training for BatchNormalization Layers

If you are going to fine-tune a pretrained model, it is important that you block the weights of all your batchnormalization layers. If you are going to train a new model from scratch, skip the next cell.

Date: August 24, 2022