(Google ML Bootcamp) Notes on DLS Coursera - II

Course 2: Improving DNNs: Hyperparameter Tuning, Regularization and Optimization

I am ashamed of these notes, you can use them but sadly they are useful for me only.

Week 1

  • High bias - Underfitting
  • High variance - Overfitting
  • L1 makes w sparse
  • Frobenius Norm, L2 for matrixes, there are different norms for matrixes
  • Did you have a proper intuition of what Regularization was?
  • Run the network before you add dropouts, as the j vs epochs is not a accurate representation of the training
  • Xavier Init : $scaler = \sqrt{\frac{1}{n^{[l-1]}}}$.
  • He Init: $scaler = \sqrt{\frac{2}{n^{[l-1]}}}$

  • Which one is $\sqrt{\frac{2}{n^{[l-1]} + n^{[l]}}} $ ?
  • weights init is not a good Hyperparameter to start playing with. Fix random init first, check that everything works then.

Week 2 - Optimization

  • Bias correction in Exponentially Weighted Averages: $v_t = \frac{v_{t-1}*\beta + \theta_t * (1-\beta)}{1-\beta^t}$

  • Bias correction for Gradient Descent with momentum is not necessary. The usual value of $\beta=0.9$ gives a good estimation for the first 10 iterations.

    $v_{t,dw} = v_{t-1, dw}*\beta + dw_t * (1-\beta)$

    $w:= w - \alpha v_{t,dw}$

  • RMSprop:

    $s_{t,dw} = s_{t-1, dw}*\beta + dw_t^2 * (1-\beta)$ - mean of the squares

    $w:= w - \alpha \frac{dw}{\sqrt{s_{t,dw}}}$ - if the variations where huge, $s_t$ makes the update more stable.

  • Adam = RMSprop + momentum

    $v_{t,dw} = v_{t-1, dw}*\beta_1 + dw_t * (1-\beta_1)$ - $\beta_1=0.9$

    $s_{t,dw} = s_{t-1, dw}*\beta_2 + dw_t^2 * (1-\beta_2)$ - mean of the squares - $\beta_2=0.999$

    $v_{t,dw} := \frac{v_{t,dw} }{1-\beta_1^t}$ - first moment

    $s_{t,dw} := \frac{s_{t,dw} }{1-\beta_2^t}$ - second moment

    $w:= w - \alpha \frac{v_{t,dw}}{\sqrt{s_{t,dw}}+\epsilon}$ - $\epsilon = 10^{-8}$

Week 3 - Tuning

  • Order of importance:

    • learning rate
    • betas, hidden units, batch size
    • learning rate decay
  • Use random values for the hyperparameters
  • Coarse the zon of good hyperparameters

  • Search for hyperparameters in scale (log for example for the learning rate, or the beta)

  • Batch Normalization is applied before the activation function:

    • Normalization if $Z^{[l]}$
    • $\mu = mean(z)$
    • $\sigma = std(z,\mu)$
    • $z_{norm} = \frac{z - \mu}{\sqrt{\sigma +\epsilon}}$
    • $\tilde{z} = \gamma z_{norm} + \beta $
    • Now we use $\tilde{z}$ instead of z.
    • Using BN you don’t need the bias, easier to set it to zero.
    • the backprop is similar to having an extra linear activation layer, $d\gamma \sim dW^{[l]}$ and $db^{[l]} \sim d\beta$
    • it decouples the changes between weights of different layers.
    • it has a slight regularization effect
    • At test time, $\mu$ and $\sigma$ are obtained from the training data, as an average or a moving average.
  • Why softmax is maximum likehood estimation?

    Similar to Sigmoid, you can provide the following ansatz: $L = \Pi p_i^{y_i}$ where $p_i$ is the probability of being the correct classification $G$ and $y_i$ is 0 except where $i=G$.

    $J = -log(L) = - \sum y_i \log{p_i} $ for maximization.

    What about $p_i$? That’s the softmax function, actually is a approximation of argmax! The soft function is differentiable.

Nice insight

  • Underfitting: when “avoidable bias” is big
  • Overfitting: when “variance” is big
 Date: July 12, 2022
 Tags:  coding ML

Previous
Tabular Playground Series - July 2022

Next
(Google ML Bootcamp) Notes on DLS Coursera - III