(Google ML Bootcamp) Notes on DLS Coursera - I
Course 1: NNs and DL
Week 1: Introduction to DL.
Geoffrey Hinton proved in the 90s that ReLU is a good approximation for a logistic regression. That’s why ReLU works fine in DNNs.
Week 2: NNs Basics
- Non-convex loss function minimum when using MSE for logistic
- Python Broadcasting is more powerful than expected haha.
- Cost/Lost function: On the batch / on a single sample
- Using random init weights is not the meta, it was never the meta.
- Vectorizing even at batch updates? Yeah
- reshape is a constant time operation
- Don’t use rank 1 vector in python, i.e (n,), instead use (n,1) or (1,n)
- Use
assert()
wisely to debug - The use of MLE begins even more into the baby examples! The logistic regression is just a form of MLE using one layer only.
-
(size_input, batch_size)
- Nomenclature:
Week 3: Shallow NNs
- (hidden_units, batch_size)
- Avoid sigmoid for hidden layers, except for output for 0-1 output
-
- At the beginning there was structured data - Genesis Book
- $n^{[L]}$: neurons in L-layer
-
$W^{[L]}$ = $[n^{[L]}, n^{[L-1]}]$ = [next_layer, prev_layer]
- $Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]} $
- $A^{[L]} = g^{[L]}(Z^{[L]})$
- $A^{[L]}$ = $[n^{[L-1]}, batch_size]$ = [data_len, batch_size], stacked columns of data.
Week 4: Deep L-Layer Neural Network
-
Early layers recognize simple information, the easiest ones to learn. The information extracted from the input is more complex the deeper you go from the first layer into later layers.
-
zeros needs shape
- Once you get the dims is easy, this is for 1-Layer
-
$ Z <-> A$ is a biyective relation,
-
$dZ^{[l]} = \frac{\partial{J}}{\partial{Z^{[l]}}} = \frac{\partial{J}}{\partial{A^{[l]}}} * \frac{\partial{A^{[l]}}} {\partial{Z^{[l]}}} $
-
$dZ^{[l]} = dA^{[l]} * g’^{[l]}(Z^{[l]})$
-
- We can continue with dA or dZ
- $dA^{[l-1]}{i,j} = \frac{\partial{J}}{\partial{A^{[l-1]}{i,j}}} = \frac{\partial{J}}{\partial{Z^{[l]}}{k,m}} \frac{\partial{Z^{[l]}}{k,m}} {\partial{A^{[l-1]}}_{i,j}} $
-
$dA^{[l-1]}{i,j} = dZ^{[l]}{k,m} \frac{\partial{Z_{k,m}}}{\partial{A^{[l-1]}_{i,j}}}$
-
$Z_{k,m}^{[l]} = W^{[l]}{k,l}A{l,m}^{[l-1]} + \dots$
-
$ \frac{\partial{Z_{k,m}}}{\partial{A^{[l-1]}{i,j}}} = W^{[l]}{k,l}\delta_{l,i}\delta_{m,j}$
-
$ \frac{\partial{Z_{k,m}}}{\partial{A^{[l-1]}{i,j}}} = W^{[l]}{k,i}\delta_{m,j}$
-
$dA^{[l-1]}{i,j} = dZ^{[l]}{k,m} W^{[l]}{k,i}\delta{m,j} = dZ^{[l]}{k,j}W^{[l]}{k,i}$
-
$dA^{[l-1]}{i,j} = W^{[l]T}{i,k} dZ^{[l]}_{k,j}$
-
Finally: $dA^{[l-1]}{} = W^{[l]T}{} dZ^{[l]}_{}$
-
$dW^{[l]}{i,j} = \frac{\partial{J}}{\partial{W^{[l]}}{i,j}} = \frac{\partial{J}}{\partial{Z^{[l]}}{k,m}} \frac{\partial{Z^{[l]}}{k,m}}{\partial{W^{[l]}}_{i,j}} $
- $dW^{[l]}{i,j} = dZ^{[l]}{k,m} \frac{\partial{Z^{[l]}}{k,m}}{\partial{W^{[l]}}{i,j}} $
As we have seen before, the matrix in the denominator creates as transpose.
-
$Z_{k,m}^{[l]} = W^{[l]}{k,l}A{l,m}^{[l-1]} + \dots$
-
$dW^{[l]}{i,j} = dZ^{[l]}{k,m} A^{[l-1]}{l,m} \delta{k,i} \delta_{l,j} = dZ^{[l]}{i,m} A^{[l-1]}{j,m} $
-
$dW^{[l]}{i,j} = dZ^{[l]}{i,m} A^{[l-1]T}_{m,j} $
Finally, $dW^{[l]}{} = dZ^{[l]}{} A^{[l-1]T}_{}$
- This is all for stochastic gradient descent.
Wrapping up:
At the output layer:
-
$dA^{[L]} \sim d\hat{Y}= \frac{\partial{J}}{\partial{\hat{Y}^{}}}$ - Gradient given the output and its target, or whatever the cost function uses to optimize. There is a mean respect the mini-batch samples (m).
-
$ dZ^{[L]} = d\hat{Y} * g’^{[L]}(Z^{[L]})$
-
$dW^{[L]} = dZ^{[L]}_{} \hat{Y}^{T}/m$, every sample provides a weights update and we take the mean. I still don’t understand where this value comes from mathematically. It must be from $\frac{\partial J}{\partial W}$ extracting $1/m$ from the mean.
At the $l$ layer:
-
$dA^{[l-1]} = W^{[l]T}{} dZ^{[l]}{}$ .
-
$ dZ^{[l]} = dA^{[l]} * g’^{[l]}(Z^{[L]})$
-
$dW^{[l]} = dZ^{[l]}_{} A^{[l-1]T}/m$, remember that in this layer we use $A^{[l-1]}$ as input and $A^{[l]}$ as output.
It’s all about the Optimization method, that’s where the M comes from.