# (Google ML Bootcamp) Notes on DLS Coursera - I

## Course 1: NNs and DL

### Week 1: Introduction to DL.

Geoffrey Hinton proved in the 90s that ReLU is a good approximation for a logistic regression. That’s why ReLU works fine in DNNs.

### Week 2: NNs Basics

• Non-convex loss function minimum when using MSE for logistic
• Python Broadcasting is more powerful than expected haha.
• Cost/Lost function: On the batch / on a single sample
• Using random init weights is not the meta, it was never the meta.
• Vectorizing even at batch updates? Yeah
• reshape is a constant time operation
• Don’t use rank 1 vector in python, i.e (n,), instead use (n,1) or (1,n)
• Use assert() wisely to debug
• The use of MLE begins even more into the baby examples! The logistic regression is just a form of MLE using one layer only.
• (size_input, batch_size)

• Nomenclature:

### Week 3: Shallow NNs

• (hidden_units, batch_size)
• Avoid sigmoid for hidden layers, except for output for 0-1 output
• At the beginning there was structured data - Genesis Book
• $n^{[L]}$: neurons in L-layer
• $W^{[L]}$ = $[n^{[L]}, n^{[L-1]}]$ = [next_layer, prev_layer]

• $Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]}$
• $A^{[L]} = g^{[L]}(Z^{[L]})$
• $A^{[L]}$ = $[n^{[L-1]}, batch_size]$ = [data_len, batch_size], stacked columns of data.

### Week 4: Deep L-Layer Neural Network

• Early layers recognize simple information, the easiest ones to learn. The information extracted from the input is more complex the deeper you go from the first layer into later layers.

• zeros needs shape

• Once you get the dims is easy, this is for 1-Layer
• $Z <-> A$ is a biyective relation,

• $dZ^{[l]} = \frac{\partial{J}}{\partial{Z^{[l]}}} = \frac{\partial{J}}{\partial{A^{[l]}}} * \frac{\partial{A^{[l]}}} {\partial{Z^{[l]}}}$

• $dZ^{[l]} = dA^{[l]} * g’^{[l]}(Z^{[l]})$

• We can continue with dA or dZ
• $dA^{[l-1]}{i,j} = \frac{\partial{J}}{\partial{A^{[l-1]}{i,j}}} = \frac{\partial{J}}{\partial{Z^{[l]}}{k,m}} \frac{\partial{Z^{[l]}}{k,m}} {\partial{A^{[l-1]}}_{i,j}}$
• $dA^{[l-1]}{i,j} = dZ^{[l]}{k,m} \frac{\partial{Z_{k,m}}}{\partial{A^{[l-1]}_{i,j}}}$

• $Z_{k,m}^{[l]} = W^{[l]}{k,l}A{l,m}^{[l-1]} + \dots$

• $\frac{\partial{Z_{k,m}}}{\partial{A^{[l-1]}{i,j}}} = W^{[l]}{k,l}\delta_{l,i}\delta_{m,j}$

• $\frac{\partial{Z_{k,m}}}{\partial{A^{[l-1]}{i,j}}} = W^{[l]}{k,i}\delta_{m,j}$

• $dA^{[l-1]}{i,j} = dZ^{[l]}{k,m} W^{[l]}{k,i}\delta{m,j} = dZ^{[l]}{k,j}W^{[l]}{k,i}$

• $dA^{[l-1]}{i,j} = W^{[l]T}{i,k} dZ^{[l]}_{k,j}$

Finally: $dA^{[l-1]}{} = W^{[l]T}{} dZ^{[l]}_{}$

• $dW^{[l]}{i,j} = \frac{\partial{J}}{\partial{W^{[l]}}{i,j}} = \frac{\partial{J}}{\partial{Z^{[l]}}{k,m}} \frac{\partial{Z^{[l]}}{k,m}}{\partial{W^{[l]}}_{i,j}}$

• $dW^{[l]}{i,j} = dZ^{[l]}{k,m} \frac{\partial{Z^{[l]}}{k,m}}{\partial{W^{[l]}}{i,j}}$

As we have seen before, the matrix in the denominator creates as transpose.

• $Z_{k,m}^{[l]} = W^{[l]}{k,l}A{l,m}^{[l-1]} + \dots$

• $dW^{[l]}{i,j} = dZ^{[l]}{k,m} A^{[l-1]}{l,m} \delta{k,i} \delta_{l,j} = dZ^{[l]}{i,m} A^{[l-1]}{j,m}$

• $dW^{[l]}{i,j} = dZ^{[l]}{i,m} A^{[l-1]T}_{m,j}$

Finally, $dW^{[l]}{} = dZ^{[l]}{} A^{[l-1]T}_{}$

• This is all for stochastic gradient descent.

### Wrapping up:

At the output layer:

• $dA^{[L]} \sim d\hat{Y}= \frac{\partial{J}}{\partial{\hat{Y}^{}}}$ - Gradient given the output and its target, or whatever the cost function uses to optimize. There is a mean respect the mini-batch samples (m).

• $dZ^{[L]} = d\hat{Y} * g’^{[L]}(Z^{[L]})$

• $dW^{[L]} = dZ^{[L]}_{} \hat{Y}^{T}/m$, every sample provides a weights update and we take the mean. I still don’t understand where this value comes from mathematically. It must be from $\frac{\partial J}{\partial W}$ extracting $1/m$ from the mean.

At the $l$ layer:

• $dA^{[l-1]} = W^{[l]T}{} dZ^{[l]}{}$ .

• $dZ^{[l]} = dA^{[l]} * g’^{[l]}(Z^{[L]})$

• $dW^{[l]} = dZ^{[l]}_{} A^{[l-1]T}/m$, remember that in this layer we use $A^{[l-1]}$ as input and $A^{[l]}$ as output.

It’s all about the Optimization method, that’s where the M comes from.

Date: July 6, 2022
Tags: