(Google ML Bootcamp) Notes on DLS Coursera  II
Course 2: Improving DNNs: Hyperparameter Tuning, Regularization and Optimization
I am ashamed of these notes, you can use them but sadly they are useful for me only.
Week 1
 High bias  Underfitting
 High variance  Overfitting
 L1 makes w sparse
 Frobenius Norm, L2 for matrixes, there are different norms for matrixes
 Did you have a proper intuition of what Regularization was?
 Run the network before you add dropouts, as the j vs epochs is not a accurate representation of the training
 Xavier Init : $scaler = \sqrt{\frac{1}{n^{[l1]}}}$.

He Init: $scaler = \sqrt{\frac{2}{n^{[l1]}}}$
 Which one is $\sqrt{\frac{2}{n^{[l1]} + n^{[l]}}} $ ?
 weights init is not a good Hyperparameter to start playing with. Fix random init first, check that everything works then.
Week 2  Optimization

Bias correction in Exponentially Weighted Averages: $v_t = \frac{v_{t1}*\beta + \theta_t * (1\beta)}{1\beta^t}$

Bias correction for Gradient Descent with momentum is not necessary. The usual value of $\beta=0.9$ gives a good estimation for the first 10 iterations.
$v_{t,dw} = v_{t1, dw}*\beta + dw_t * (1\beta)$
$w:= w  \alpha v_{t,dw}$

RMSprop:
$s_{t,dw} = s_{t1, dw}*\beta + dw_t^2 * (1\beta)$  mean of the squares
$w:= w  \alpha \frac{dw}{\sqrt{s_{t,dw}}}$  if the variations where huge, $s_t$ makes the update more stable.

Adam = RMSprop + momentum
$v_{t,dw} = v_{t1, dw}*\beta_1 + dw_t * (1\beta_1)$  $\beta_1=0.9$
$s_{t,dw} = s_{t1, dw}*\beta_2 + dw_t^2 * (1\beta_2)$  mean of the squares  $\beta_2=0.999$
$v_{t,dw} := \frac{v_{t,dw} }{1\beta_1^t}$  first moment
$s_{t,dw} := \frac{s_{t,dw} }{1\beta_2^t}$  second moment
$w:= w  \alpha \frac{v_{t,dw}}{\sqrt{s_{t,dw}}+\epsilon}$  $\epsilon = 10^{8}$
Week 3  Tuning

Order of importance:
 learning rate
 betas, hidden units, batch size
 learning rate decay
 Use random values for the hyperparameters

Coarse the zon of good hyperparameters

Search for hyperparameters in scale (log for example for the learning rate, or the beta)

Batch Normalization is applied before the activation function:
 Normalization if $Z^{[l]}$
 $\mu = mean(z)$
 $\sigma = std(z,\mu)$
 $z_{norm} = \frac{z  \mu}{\sqrt{\sigma +\epsilon}}$
 $\tilde{z} = \gamma z_{norm} + \beta $
 Now we use $\tilde{z}$ instead of z.
 Using BN you don’t need the bias, easier to set it to zero.
 the backprop is similar to having an extra linear activation layer, $d\gamma \sim dW^{[l]}$ and $db^{[l]} \sim d\beta$
 it decouples the changes between weights of different layers.
 it has a slight regularization effect
 At test time, $\mu$ and $\sigma$ are obtained from the training data, as an average or a moving average.

Why softmax is maximum likehood estimation?
Similar to Sigmoid, you can provide the following ansatz: $L = \Pi p_i^{y_i}$ where $p_i$ is the probability of being the correct classification $G$ and $y_i$ is 0 except where $i=G$.
$J = log(L) =  \sum y_i \log{p_i} $ for maximization.
What about $p_i$? That’s the softmax function, actually is a approximation of argmax! The soft function is differentiable.
Nice insight
 Underfitting: when “avoidable bias” is big
 Overfitting: when “variance” is big