(Google ML Bootcamp) Notes on DLS Coursera - V

Course V - Sequence Models

RNNs

$a$ is always related to the previous value before the activation function of the next layer is applied.

  • $a^{}$ and $x_i$ can be joint into a single longer vector, with a matching W matrix.
  • The loss function is calculated for the full sequence

RNNs-Types

What you should remember:(RNNs)

  • The recurrent neural network, or RNN, is essentially the repeated use of a single cell.
  • A basic RNN reads inputs one at a time, and remembers information through the hidden layer activations (hidden states) that are passed from one time step to the next.
  • The time step dimension determines how many times to re-use the RNN cell
  • Each cell takes two inputs at each time step:
    • The hidden state from the previous cell
    • The current time step’s input data
  • Each cell has two outputs at each time step:
    • A hidden state
    • A prediction

Overview of gates and states

Forget gate $\mathbf{\Gamma}_{f}$

  • Let’s assume you are reading words in a piece of text, and plan to use an LSTM to keep track of grammatical structures, such as whether the subject is singular (“puppy”) or plural (“puppies”).
  • If the subject changes its state (from a singular word to a plural word), the memory of the previous state becomes outdated, so you’ll “forget” that outdated state.
  • The “forget gate” is a tensor containing values between 0 and 1.
    • If a unit in the forget gate has a value close to 0, the LSTM will “forget” the stored state in the corresponding unit of the previous cell state.
    • If a unit in the forget gate has a value close to 1, the LSTM will mostly remember the corresponding value in the stored state.
Equation
\[\mathbf{\Gamma}_f^{\langle t \rangle} = \sigma(\mathbf{W}_f[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_f)\tag{1}\]
Explanation of the equation:
  • $\mathbf{W_{f}}$ contains weights that govern the forget gate’s behavior.
  • The previous time step’s hidden state $[a^{\langle t-1 \rangle}$ and current time step’s input $x^{\langle t \rangle}]$ are concatenated together and multiplied by $\mathbf{W_{f}}$.
  • A sigmoid function is used to make each of the gate tensor’s values $\mathbf{\Gamma}_f^{\langle t \rangle}$ range from 0 to 1.
  • The forget gate $\mathbf{\Gamma}_f^{\langle t \rangle}$ has the same dimensions as the previous cell state $c^{\langle t-1 \rangle}$.
  • This means that the two can be multiplied together, element-wise.
  • Multiplying the tensors $\mathbf{\Gamma}_f^{\langle t \rangle} * \mathbf{c}^{\langle t-1 \rangle}$ is like applying a mask over the previous cell state.
  • If a single value in $\mathbf{\Gamma}_f^{\langle t \rangle}$ is 0 or close to 0, then the product is close to 0.
    • This keeps the information stored in the corresponding unit in $\mathbf{c}^{\langle t-1 \rangle}$ from being remembered for the next time step.
  • Similarly, if one value is close to 1, the product is close to the original value in the previous cell state.
    • The LSTM will keep the information from the corresponding unit of $\mathbf{c}^{\langle t-1 \rangle}$, to be used in the next time step.

Candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$

  • The candidate value is a tensor containing information from the current time step that may be stored in the current cell state $\mathbf{c}^{\langle t \rangle}$.
  • The parts of the candidate value that get passed on depend on the update gate.
  • The candidate value is a tensor containing values that range from -1 to 1.
  • The tilde “~” is used to differentiate the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ from the cell state $\mathbf{c}^{\langle t \rangle}$.
Equation

\(\mathbf{\tilde{c}}^{\langle t \rangle} = \tanh\left( \mathbf{W}_{c} [\mathbf{a}^{\langle t - 1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{c} \right) \tag{3}\)

Explanation of the equation
  • The tanh function produces values between -1 and 1.

Update gate $\mathbf{\Gamma}_{i}$

  • You use the update gate to decide what aspects of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ to add to the cell state $c^{\langle t \rangle}$.
  • The update gate decides what parts of a “candidate” tensor $\tilde{\mathbf{c}}^{\langle t \rangle}$ are passed onto the cell state $\mathbf{c}^{\langle t \rangle}$.
  • The update gate is a tensor containing values between 0 and 1.
    • When a unit in the update gate is close to 1, it allows the value of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ to be passed onto the hidden state $\mathbf{c}^{\langle t \rangle}$
    • When a unit in the update gate is close to 0, it prevents the corresponding value in the candidate from being passed onto the hidden state.
  • Notice that the subscript “i” is used and not “u”, to follow the convention used in the literature.
Equation
\[\mathbf{\Gamma}_i^{\langle t \rangle} = \sigma(\mathbf{W}_i[a^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_i)\tag{2}\]
Explanation of the equation
  • Similar to the forget gate, here $\mathbf{\Gamma}_i^{\langle t \rangle}$, the sigmoid produces values between 0 and 1.
  • The update gate is multiplied element-wise with the candidate, and this product ($\mathbf{\Gamma}_{i}^{\langle t \rangle} * \tilde{c}^{\langle t \rangle}$) is used in determining the cell state $\mathbf{c}^{\langle t \rangle}$.

Cell state $\mathbf{c}^{\langle t \rangle}$

  • The cell state is the “memory” that gets passed onto future time steps.
  • The new cell state $\mathbf{c}^{\langle t \rangle}$ is a combination of the previous cell state and the candidate value.
Equation
\[\mathbf{c}^{\langle t \rangle} = \mathbf{\Gamma}_f^{\langle t \rangle}* \mathbf{c}^{\langle t-1 \rangle} + \mathbf{\Gamma}_{i}^{\langle t \rangle} *\mathbf{\tilde{c}}^{\langle t \rangle} \tag{4}\]
Explanation of equation
  • The previous cell state $\mathbf{c}^{\langle t-1 \rangle}$ is adjusted (weighted) by the forget gate $\mathbf{\Gamma}_{f}^{\langle t \rangle}$
  • and the candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$, adjusted (weighted) by the update gate $\mathbf{\Gamma}_{i}^{\langle t \rangle}$

Output gate $\mathbf{\Gamma}_{o}$

  • The output gate decides what gets sent as the prediction (output) of the time step.
  • The output gate is like the other gates, in that it contains values that range from 0 to 1.
Equation
\[\mathbf{\Gamma}_o^{\langle t \rangle}= \sigma(\mathbf{W}_o[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{o})\tag{5}\]
Explanation of the equation
  • The output gate is determined by the previous hidden state $\mathbf{a}^{\langle t-1 \rangle}$ and the current input $\mathbf{x}^{\langle t \rangle}$
  • The sigmoid makes the gate range from 0 to 1.

Hidden state $\mathbf{a}^{\langle t \rangle}$

  • The hidden state gets passed to the LSTM cell’s next time step.
  • It is used to determine the three gates ($\mathbf{\Gamma}{f}, \mathbf{\Gamma}{u}, \mathbf{\Gamma}_{o}$) of the next time step.
  • The hidden state is also used for the prediction $y^{\langle t \rangle}$.
Equation
\[\mathbf{a}^{\langle t \rangle} = \mathbf{\Gamma}_o^{\langle t \rangle} * \tanh(\mathbf{c}^{\langle t \rangle})\tag{6}\]
Explanation of equation
  • The hidden state $\mathbf{a}^{\langle t \rangle}$ is determined by the cell state $\mathbf{c}^{\langle t \rangle}$ in combination with the output gate $\mathbf{\Gamma}_{o}$.
  • The cell state state is passed through the tanh function to rescale values between -1 and 1.
  • The output gate acts like a “mask” that either preserves the values of $\tanh(\mathbf{c}^{\langle t \rangle})$ or keeps those values from being included in the hidden state $\mathbf{a}^{\langle t \rangle}$

Prediction $\mathbf{y}^{\langle t \rangle}_{pred}$

  • The prediction in this use case is a classification, so you’ll use a softmax.

The equation is: \(\mathbf{y}^{\langle t \rangle}_{pred} = \textrm{softmax}(\mathbf{W}_{y} \mathbf{a}^{\langle t \rangle} + \mathbf{b}_{y})\)

What you should remember:

  • An LSTM is similar to an RNN in that they both use hidden states to pass along information, but an LSTM also uses a cell state, which is like a long-term memory, to help deal with the issue of vanishing gradients
  • An LSTM cell consists of a cell state, or long-term memory, a hidden state, or short-term memory, along with 3 gates that constantly update the relevancy of its inputs:
    • A forget gate, which decides which input units should be remembered and passed along. It’s a tensor with values between 0 and 1.
      • If a unit has a value close to 0, the LSTM will “forget” the stored state in the previous cell state.
      • If it has a value close to 1, the LSTM will mostly remember the corresponding value.
    • An update gate, again a tensor containing values between 0 and 1. It decides on what information to throw away, and what new information to add.
      • When a unit in the update gate is close to 1, the value of its candidate is passed on to the hidden state.
      • When a unit in the update gate is close to 0, it’s prevented from being passed onto the hidden state.
    • And an output gate, which decides what gets sent as the output of the time step

What you should remember:

  • A sequence model can be used to generate musical values, which are then post-processed into midi music.
  • You can use a fairly similar model for tasks ranging from generating dinosaur names to generating original music, with the only major difference being the input fed to the model.
  • In Keras, sequence generation involves defining layers with shared weights, which are then repeated for the different time steps $1, \ldots, T_x$.

What you should remember (This applies only to tensorflow layers):

  • If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly.
  • Word embeddings allow your model to work on words in the test set that may not even appear in the training set.
  • Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
    • To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
    • An Embedding() layer can be initialized with pretrained values.
      • These values can be either fixed or trained further on your dataset.
      • If however your labeled dataset is small, it’s usually not worth trying to train a large pre-trained set of embeddings.
    • LSTM() has a flag called return_sequences to decide if you would like to return every hidden states or only the last one.
    • You can use Dropout() right after LSTM() to regularize your network.

2.2.1 Block Training for BatchNormalization Layers

If you are going to fine-tune a pretrained model, it is important that you block the weights of all your batchnormalization layers. If you are going to train a new model from scratch, skip the next cell.

 Date: August 24, 2022
 Tags:  coding ML

Previous
Using Internet Smarter

Next
Resources and Tips for CI/CD