Gradient of Loss wrt Inputs

2025-01-01 23:51:37 +00:00 · 2025-01-01 23:51:37 +00:00 · d4f2566363
commit d4f2566363
parent 14fa6a99f3
4 changed files with 544 additions and 5 deletions
--- a/lecture13_17/notes_13.ipynb
+++ b/lecture13_17/notes_13.ipynb
@ -537,7 +537,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
@ -630,11 +630,24 @@
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
   "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "# Gradients of the Loss with Respect to Inputs\n",
+    "When chaining multiple layers together, we will need the partial derivatives of the loss with respect to the next layers input (ie, the output of the current layer). This involves extra summation because the output of 1 layer is fed into every neuron of the next layer, so the total loss must be found.\n",
+    "\n",
+    "The gradient of the loss with respect to the $n$ input fed into $i$ neurons is\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta x_n} = \\frac{\\delta l}{\\delta z_1} \\frac{\\delta z_1}{\\delta x_n} + \\frac{\\delta l}{\\delta z_2} \\frac{\\delta z_2}{\\delta x_n} + ... + \\frac{\\delta l}{\\delta z_i} \\frac{\\delta z_i}{\\delta x_n}$\n",
+    "\n",
+    "\n",
+    "Noting that $\\frac{\\delta z_i}{\\delta x_n} = w_{in}$ allows us to have\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta \\vec{X}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta x_1} & \\frac{\\delta l}{\\delta x_2} & \\cdots & \\frac{\\delta l}{\\delta x_n} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in} \\end{bmatrix}$\n",
+    "\n",
+    "## Note With Layer_Dense class\n",
+    "The Layer_Dense class has the weights stored in the transposed fashion for forward propagation. Therefore, the weight matrix must be transposed for the backpropagation."
+   ]
  }
 ],
 "metadata": {
--- a/lecture13_17/notes_13.pdf
+++ b/lecture13_17/notes_13.pdf
--- a/lecture13_17/notes_13.py
+++ b/lecture13_17/notes_13.py
@ -0,0 +1,346 @@
+# %% [markdown]
+# # Previous Class Definitions
+# The previously defined Layer_Dense, Activation_ReLU, Activation_Softmax, Loss, and Loss_CategoricalCrossEntropy classes.
+
+# %%
+# imports
+import matplotlib.pyplot as plt
+import numpy as np
+import nnfs
+from nnfs.datasets import spiral_data, vertical_data
+nnfs.init()
+
+# %%
+class Layer_Dense:
+    def __init__(self, n_inputs, n_neurons):
+        # Initialize the weights and biases
+        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)  # Normal distribution of weights
+        self.biases = np.zeros((1, n_neurons))
+
+    def forward(self, inputs):
+        # Calculate the output values from inputs, weights, and biases
+        self.output = np.dot(inputs, self.weights) + self.biases        # Weights are already transposed
+
+class Activation_ReLU:
+    def forward(self, inputs):
+        self.output = np.maximum(0, inputs)
+        
+class Activation_Softmax:
+    def forward(self, inputs):
+        # Get the unnormalized probabilities
+        # Subtract max from the row to prevent larger numbers
+        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
+
+        # Normalize the probabilities with element wise division
+        probabilities = exp_values / np.sum(exp_values, axis=1,keepdims=True)
+        self.output = probabilities
+
+# Base class for Loss functions
+class Loss:
+    '''Calculates the data and regularization losses given
+    model output and ground truth values'''
+    def calculate(self, output, y):
+        sample_losses = self.forward(output, y)
+        data_loss = np.average(sample_losses)
+        return data_loss
+
+class Loss_CategoricalCrossEntropy(Loss):
+    def forward(self, y_pred, y_true):
+        '''y_pred is the neural network output
+        y_true is the ideal output of the neural network'''
+        samples = len(y_pred)
+        # Bound the predicted values 
+        y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)
+        
+        if len(y_true.shape) == 1:     # Categorically labeled
+            correct_confidences = y_pred_clipped[range(samples), y_true]
+        elif len(y_true.shape) == 2:   # One hot encoded
+            correct_confidences = np.sum(y_pred_clipped*y_true, axis=1)
+
+        # Calculate the losses
+        negative_log_likelihoods = -np.log(correct_confidences)
+        return negative_log_likelihoods
+
+# %% [markdown]
+# # Backpropagation of a Single Neuron
+# Backpropagation helps us find the gradient of the neural network with respect to each of the parameters (weights and biases) of each neuron.
+# 
+# Imagine a layer that has 3 inputs and 1 neuron. There are 3 inputs (x0, x1, x2), three weights (w0, w1, w2), 1 bias (b0), and 1 output (z). There is a ReLU activation layer after the neuron output going into a square loss function (loss = z^2).
+# 
+# Loss = (ReLU(sum(mul(x0, w0), mul(x1, w1), mul(x2, w2(, b0)))))^2
+# 
+# $\frac{\delta Loss()}{\delta w0} = \frac{\delta Loss()}{\delta ReLU()} * \frac{\delta ReLU()}{\delta sum()} * \frac{\delta sum()}{\delta mul(x0, w0)} * \frac{\delta mul(x0, w0)}{\delta w0}$
+# 
+# $\frac{\delta Loss()}{\delta ReLU()} = 2 * ReLU(sum(...))$
+# 
+# $\frac{\delta ReLU()}{\delta sum()}$ = 0 if sum(...) is less than 0 and 1 if sum(...) is greater than 0
+# 
+# $\frac{\delta sum()}{\delta mul(x0, w0)} = 1$
+# 
+# $\frac{\delta mul(x0, w0)}{\delta w0} = x0$
+# 
+# This is repeated for w0, w1, w2, b0.
+# 
+# We then use numerical differentiation to approximate the gradient. Then, we update the parameters using small step sizes, such that $w0[i+1] = w0[i] - step*\frac{\delta Loss()}{\delta w0}$
+# 
+
+# %%
+import numpy as np
+
+# Initial parameters
+weights = np.array([-3.0, -1.0, 2.0])
+bias = 1.0
+inputs = np.array([1.0, -2.0, 3.0])
+target_output = 0.0
+learning_rate = 0.001
+
+def relu(x):
+    return np.maximum(0, x)
+
+def relu_derivative(x):
+    return np.where(x > 0, 1.0, 0.0)
+
+for iteration in range(200):
+    # Forward pass
+    linear_output = np.dot(weights, inputs) + bias
+    output = relu(linear_output)
+    loss = (output - target_output) ** 2
+
+    # Backward pass to calculate gradient
+    dloss_doutput = 2 * (output - target_output)
+    doutput_dlinear = relu_derivative(linear_output)
+    dlinear_dweights = inputs
+    dlinear_dbias = 1.0
+
+    dloss_dlinear = dloss_doutput * doutput_dlinear
+    dloss_dweights = dloss_dlinear * dlinear_dweights
+    dloss_dbias = dloss_dlinear * dlinear_dbias
+
+    # Update weights and bias
+    weights -= learning_rate * dloss_dweights
+    bias -= learning_rate * dloss_dbias
+
+    # Print the loss for this iteration
+    print(f"Iteration {iteration + 1}, Loss: {loss}")
+
+print("Final weights:", weights)
+print("Final bias:", bias)
+
+
+# %% [markdown]
+# # Backpropagation of a Layer
+# Same thing as a single neuron, but now using matrices to keep track of each neuron in the layer.
+# 
+# If there are multiple input arrays (batches), one can take the summation of the loss from each batch as a total loss, and therefore the gradient of the total loss with respect to a weight or bias is the summation of the gradients of each batch's loss with respect to the weight or bias given that batch's input.
+# 
+# In general, the partial derivative of the loss with respect to a specific weight or bias remains the same across all neurons of that layer for that batch. ie, the weight gradient matrix has the same column vector for N number of neurons. The bias gradient matrix is similar but is a single row of N elements for the same value.
+
+# %%
+import numpy as np
+
+# Initial inputs
+inputs = np.array([1, 2, 3, 4])
+
+# Initial weights and biases
+weights = np.array([
+    [0.1, 0.2, 0.3, 0.4],
+    [0.5, 0.6, 0.7, 0.8],
+    [0.9, 1.0, 1.1, 1.2]
+])
+
+biases = np.array([0.1, 0.2, 0.3])
+
+learning_rate = 0.001
+
+# Add the derivative function to the ReLU class
+class Activation_ReLU:
+    def forward(self, inputs):
+        return np.maximum(0, inputs)
+    
+    def derivative(self, inputs):
+        return np.where(inputs > 0, 1, 0)
+    
+relu = Activation_ReLU()
+
+num_iterations = 200
+
+# Training loop
+# A single layer of 3 neurons, each with 4 inputs
+# The neuron layer is then fed into a ReLU activation layer
+for iteration in range(num_iterations):
+    # Forward pass
+    neuron_outputs = np.dot(weights, inputs) + biases
+    relu_outputs = relu.forward(neuron_outputs)
+    
+    # Calculate the squared loss assuming the desired output is a sum of 0. Trivial but just an example
+    final_output = np.sum(relu_outputs)
+    loss = final_output**2
+
+    # Backward pass
+    dL_dfinal_output = 2 * final_output
+    dfinal_output_drelu_output = np.ones_like(relu_outputs)
+    drelu_output_dneuron_output = relu.derivative(neuron_outputs)
+
+    dL_dneuron_output = dL_dfinal_output * dfinal_output_drelu_output * drelu_output_dneuron_output
+
+    # Get the gradient of the Loss with respect to the weights and biases
+    # dL_dW = np.outer(dL_dneuron_output, inputs)
+    dL_dW = inputs.reshape(-1, 1) @ dL_dneuron_output.reshape(1, -1)
+    dL_db = dL_dneuron_output
+
+    # Update the weights and biases
+    # Remove the .T if using dL_dW = np.outer(dL_dneuron_output, inputs)
+    weights -= learning_rate * dL_dW.T
+    biases -= learning_rate * dL_db
+
+    # Print the loss every 20 iterations
+    if iteration % 20 == 0:
+        print(f"Iteration {iteration}, Loss: {loss}")
+
+# Final weights and biases
+print("Final weights:\n", weights)
+print("Final biases:\n", biases)
+
+
+# %%
+
+
+# %% [markdown]
+# ## Change of Notation
+# The previous notation is clunky and long. From here forward, we will use the following notation for a layer with $n$ inputs and $i$ neurons. The neruon layer has is followed by an activation layer and then fed into a final value $y$ with a computed loss $l$. There can be $j$ batches of data.
+# 
+# $\vec{X_j} = \begin{bmatrix} x_{1j} & x_{2j} & \cdots & x_{nj} \end{bmatrix}$ -> Row vector for the layer inputs for the $j$ batch of data.
+# 
+# $\overline{\overline{W}} = \begin{bmatrix} \vec{w_{1}} \\ \vec{w_{2}} \\ \vdots \\ \vec{w_{i}} \end{bmatrix} = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{i1} & w_{i2} & \cdots & w_{in}\end{bmatrix}$ -> Matrix of weight values. Each row is a neuron's weights and each column is the weights for a given input.
+# 
+# $\vec{B} = \begin{bmatrix} b_1 & b_2 & \cdots & b_i \end{bmatrix}$ -> Row vector for the neuron biases
+# 
+# $\vec{Z_j} = \begin{bmatrix} z_{1j} & z_{2j} & \cdots & z_{ij} \end{bmatrix}$ -> Row vector for the neuron outputs for the $j$ batch of data.
+# 
+# $\vec{A_j} = \begin{bmatrix} a_{1j} & a_{2j} & \cdots & a_{ij} \end{bmatrix}$ -> Row vector for the activation later outputs for the $j$ batch of data.
+# 
+# $y_j$ -> Final layer output for the $j$ batch of data if the layer is the final layer (could be summation, probability, etc).
+# 
+# $l_j$ -> Loss for the $j$ batch of data.
+# 
+# The $j$ is often dropped because we typically only need to think with 1 set of input data.
+# 
+# ### Gradient Descent Using New Notation
+# We will look at the weight that the $i$ neuron applies for the $n$ input.
+# 
+# $\frac{\delta l}{\delta w_{in}} = \frac{\delta l}{\delta y} \frac{\delta y}{\delta a_i} \frac{\delta a_i}{\delta z_i} \frac{\delta z_i}{\delta w_{in}}$
+# 
+# Similarly, for the bias of the $i$ neuron, there is
+# 
+# $\frac{\delta l}{\delta b_{i}} = \frac{\delta l}{\delta y} \frac{\delta y}{\delta a_i} \frac{\delta a_i}{\delta z_i} \frac{\delta z_i}{\delta b_{i}}$
+# 
+# For the system we are using, where $l = (y-0)^2$ and the activation layer is ReLU, we have
+# 
+# $\frac{\delta l}{\delta y} = 2y$
+# 
+# $\frac{\delta y}{\delta a_i} = 1$
+# 
+# $\frac{\delta a_i}{\delta z_i} = 1$ if $z_i > 0$ else $0$
+# 
+# $\frac{\delta z_i}{\delta w_{in}} = x_n$
+# 
+# $\frac{\delta z_i}{\delta b_{i}} = 1$
+# 
+# ### Matrix Representation of Gradient Descent
+# We can simplify by seeing that $\frac{\delta l}{\delta y} \frac{\delta y}{\delta a_i} \frac{\delta a_i}{\delta z_i} = \frac{\delta l}{\delta z_i}$ is a common term.
+# 
+# We take $\frac{\delta l}{\delta z_i}$ and turn it into a 1 x $i$ vector that such that 
+# 
+# $\frac{\delta l}{\delta \vec{Z}} = \begin{bmatrix} \frac{\delta l}{\delta z_1} & \frac{\delta l}{\delta z_2} & \cdots & \frac{\delta l}{\delta z_i} \end{bmatrix}$
+# 
+# We than can get that the gradient matrix for all weights is a $i$ x $n$ matrix given by 
+# 
+# $\frac{\delta l}{\delta \overline{\overline{W}}} =  \begin{bmatrix}  \frac{\delta l}{\delta w_{11}} & \frac{\delta l}{\delta w_{12}} & \cdots & \frac{\delta l}{\delta w_{1n}} \\ \frac{\delta l}{\delta w_{21}} & w\frac{\delta l}{\delta w_{22}} & \cdots & \frac{\delta l}{\delta w_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\delta l}{\delta w_{i1}} & \frac{\delta l}{\delta w_{i2}} & \cdots & \frac{\delta l}{\delta w_{in}} \end{bmatrix} = \begin{bmatrix} \frac{\delta l}{\delta z_1} \\ \frac{\delta l}{\delta z_2} \\ \vdots \\ \frac{\delta l}{\delta z_n} \end{bmatrix} \begin{bmatrix} \frac{\delta z_1}{\delta w_{i1}} & \frac{\delta z_1}{\delta w_{i1}} & \cdots & \frac{\delta z_1}{\delta w_{in}} \end{bmatrix} = \begin{bmatrix} \frac{\delta l}{\delta z_1} \\ \frac{\delta l}{\delta z_2} \\ \vdots \\ \frac{\delta l}{\delta z_n} \end{bmatrix} \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}$
+# 
+# Similarly, the gradient vector for the biases is given by
+# $\frac{\delta l}{\delta \vec{B}} = \frac{\delta l}{\delta \vec{Z}} \frac{\delta \vec{Z}}{\delta \vec{B}} = \vec{1} \begin{bmatrix} \frac{\delta l}{\delta z_1} & \frac{\delta l}{\delta z_2} & \cdots & \frac{\delta l}{\delta z_i} \end{bmatrix}$
+# 
+
+# %%
+# Code changed to match new notation
+import numpy as np
+
+# Initial inputs
+X = np.array([1, 2, 3, 4])
+
+# Initial weights and biases
+W = np.array([
+    [0.1, 0.2, 0.3, 0.4],
+    [0.5, 0.6, 0.7, 0.8],
+    [0.9, 1.0, 1.1, 1.2]
+])
+
+B = np.array([0.1, 0.2, 0.3])
+
+learning_rate = 0.001
+
+# Add the derivative function to the ReLU class
+class Activation_ReLU:
+    def forward(self, inputs):
+        return np.maximum(0, inputs)
+    
+    def derivative(self, inputs):
+        return np.where(inputs > 0, 1, 0)
+    
+relu = Activation_ReLU()
+
+num_iterations = 200
+
+# Training loop
+# A single layer of 3 neurons, each with 4 inputs
+# The neuron layer is then fed into a ReLU activation layer
+for iteration in range(num_iterations):
+    # Forward pass
+    Z = np.dot(W, X) + B
+    A = relu.forward(Z)
+    
+    # Calculate the squared loss assuming the desired output is a sum of 0. Trivial but just an example
+    y = np.sum(A)
+    l = y**2
+
+    # Backward pass
+    dL_dy = 2 * y
+    dy_dA = np.ones_like(A)
+    dA_dZ = relu.derivative(Z)
+
+    dl_dZ = dL_dy * dy_dA * dA_dZ
+
+    # Get the gradient of the Loss with respect to the weights and biases
+    dL_dW = np.outer(X.T, dl_dZ)
+    dL_dB = dl_dZ
+
+    # Update the weights and biases
+    W -= learning_rate * dL_dW.T
+    B -= learning_rate * dL_dB
+
+    # Print the loss every 20 iterations
+    if iteration % 20 == 0:
+        print(f"Iteration {iteration}, Loss: {l}")
+
+# Final weights and biases
+print("Final weights:\n", W)
+print("Final biases:\n", B)
+
+
+# %% [markdown]
+# # Gradients of the Loss with Respect to Inputs
+# When chaining multiple layers together, we will need the partial derivatives of the loss with respect to the next layers input (ie, the output of the current layer). This involves extra summation because the output of 1 layer is fed into every neuron of the next layer, so the total loss must be found.
+# 
+# The gradient of the loss with respect to the $n$ input fed into $i$ neurons is
+# 
+# $\frac{\delta l}{\delta x_n} = \frac{\delta l}{\delta z_1} \frac{\delta z_1}{\delta x_n} + \frac{\delta l}{\delta z_2} \frac{\delta z_2}{\delta x_n} + ... + \frac{\delta l}{\delta z_i} \frac{\delta z_i}{\delta x_n}$
+# 
+# 
+# Noting that $\frac{\delta z_i}{\delta x_n} = w_{in}$ allows us to have
+# 
+# $\frac{\delta l}{\delta \vec{X}} = \begin{bmatrix} \frac{\delta l}{\delta x_1} & \frac{\delta l}{\delta x_2} & \cdots & \frac{\delta l}{\delta x_n} \end{bmatrix} = \begin{bmatrix} \frac{\delta l}{\delta z_1} & \frac{\delta l}{\delta z_2} & \cdots & \frac{\delta l}{\delta z_n} \end{bmatrix} \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{i1} & w_{i2} & \cdots & w_{in} \end{bmatrix}$
+# 
+# ## Note With Layer_Dense class
+# The Layer_Dense class has the weights stored in the transposed fashion for forward propagation. Therefore, the weight matrix must be transposed for the backpropagation.
+
+
--- a/lecture18/notes_18.ipynb
+++ b/lecture18/notes_18.ipynb
@ -0,0 +1,180 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Previous Class Definitions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# imports\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import nnfs\n",
+    "from nnfs.datasets import spiral_data, vertical_data\n",
+    "nnfs.init()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Layer_Dense:\n",
+    "    def __init__(self, n_inputs, n_neurons):\n",
+    "        # Initialize the weights and biases\n",
+    "        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)  # Normal distribution of weights\n",
+    "        self.biases = np.zeros((1, n_neurons))\n",
+    "\n",
+    "    def forward(self, inputs):\n",
+    "        # Calculate the output values from inputs, weights, and biases\n",
+    "        self.output = np.dot(inputs, self.weights) + self.biases        # Weights are already transposed\n",
+    "\n",
+    "class Activation_ReLU:\n",
+    "    def forward(self, inputs):\n",
+    "        self.output = np.maximum(0, inputs)\n",
+    "    \n",
+    "    def derivative(self, inputs):\n",
+    "        return np.where(inputs > 0, 1, 0)\n",
+    "        \n",
+    "class Activation_Softmax:\n",
+    "    def forward(self, inputs):\n",
+    "        # Get the unnormalized probabilities\n",
+    "        # Subtract max from the row to prevent larger numbers\n",
+    "        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))\n",
+    "\n",
+    "        # Normalize the probabilities with element wise division\n",
+    "        probabilities = exp_values / np.sum(exp_values, axis=1,keepdims=True)\n",
+    "        self.output = probabilities\n",
+    "\n",
+    "# Base class for Loss functions\n",
+    "class Loss:\n",
+    "    '''Calculates the data and regularization losses given\n",
+    "    model output and ground truth values'''\n",
+    "    def calculate(self, output, y):\n",
+    "        sample_losses = self.forward(output, y)\n",
+    "        data_loss = np.average(sample_losses)\n",
+    "        return data_loss\n",
+    "\n",
+    "class Loss_CategoricalCrossEntropy(Loss):\n",
+    "    def forward(self, y_pred, y_true):\n",
+    "        '''y_pred is the neural network output\n",
+    "        y_true is the ideal output of the neural network'''\n",
+    "        samples = len(y_pred)\n",
+    "        # Bound the predicted values \n",
+    "        y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)\n",
+    "        \n",
+    "        if len(y_true.shape) == 1:     # Categorically labeled\n",
+    "            correct_confidences = y_pred_clipped[range(samples), y_true]\n",
+    "        elif len(y_true.shape) == 2:   # One hot encoded\n",
+    "            correct_confidences = np.sum(y_pred_clipped*y_true, axis=1)\n",
+    "\n",
+    "        # Calculate the losses\n",
+    "        negative_log_likelihoods = -np.log(correct_confidences)\n",
+    "        return negative_log_likelihoods"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Previous Notes and Notation\n",
+    "The previous notation is clunky and long. From here forward, we will use the following notation for a layer with $n$ inputs and $i$ neurons. The neruon layer has is followed by an activation layer and then fed into a final value $y$ with a computed loss $l$. There can be $j$ batches of data.\n",
+    "\n",
+    "$\\vec{X_j} = \\begin{bmatrix} x_{1j} & x_{2j} & \\cdots & x_{nj} \\end{bmatrix}$ -> Row vector for the layer inputs for the $j$ batch of data.\n",
+    "\n",
+    "$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values. Each row is a neuron's weights and each column is the weights for a given input.\n",
+    "\n",
+    "$\\vec{B} = \\begin{bmatrix} b_1 & b_2 & \\cdots & b_i \\end{bmatrix}$ -> Row vector for the neuron biases\n",
+    "\n",
+    "$\\vec{Z_j} = \\begin{bmatrix} z_{1j} & z_{2j} & \\cdots & z_{ij} \\end{bmatrix}$ -> Row vector for the neuron outputs for the $j$ batch of data.\n",
+    "\n",
+    "$\\vec{A_j} = \\begin{bmatrix} a_{1j} & a_{2j} & \\cdots & a_{ij} \\end{bmatrix}$ -> Row vector for the activation later outputs for the $j$ batch of data.\n",
+    "\n",
+    "$y_j$ -> Final layer output for the $j$ batch of data if the layer is the final layer (could be summation, probability, etc).\n",
+    "\n",
+    "$l_j$ -> Loss for the $j$ batch of data.\n",
+    "\n",
+    "The $j$ is often dropped because we typically only need to think with 1 set of input data.\n",
+    "\n",
+    "## Gradient Descent Using New Notation\n",
+    "We will look at the weight that the $i$ neuron applies for the $n$ input.\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta w_{in}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta w_{in}}$\n",
+    "\n",
+    "Similarly, for the bias of the $i$ neuron, there is\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta b_{i}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta b_{i}}$\n",
+    "\n",
+    "For the system we are using, where $l = (y-0)^2$ and the activation layer is ReLU, we have\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta y} = 2y$\n",
+    "\n",
+    "$\\frac{\\delta y}{\\delta a_i} = 1$\n",
+    "\n",
+    "$\\frac{\\delta a_i}{\\delta z_i} = 1$ if $z_i > 0$ else $0$\n",
+    "\n",
+    "$\\frac{\\delta z_i}{\\delta w_{in}} = x_n$\n",
+    "\n",
+    "$\\frac{\\delta z_i}{\\delta b_{i}} = 1$\n",
+    "\n",
+    "## Matrix Representation of Gradient Descent\n",
+    "We can simplify by seeing that $\\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} = \\frac{\\delta l}{\\delta z_i}$ is a common term.\n",
+    "\n",
+    "We take $\\frac{\\delta l}{\\delta z_i}$ and turn it into a 1 x $i$ vector that such that \n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta \\vec{Z}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n",
+    "\n",
+    "We than can get that the gradient matrix for all weights is a $i$ x $n$ matrix given by \n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta \\overline{\\overline{W}}} =  \\begin{bmatrix}  \\frac{\\delta l}{\\delta w_{11}} & \\frac{\\delta l}{\\delta w_{12}} & \\cdots & \\frac{\\delta l}{\\delta w_{1n}} \\\\ \\frac{\\delta l}{\\delta w_{21}} & w\\frac{\\delta l}{\\delta w_{22}} & \\cdots & \\frac{\\delta l}{\\delta w_{2n}} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ \\frac{\\delta l}{\\delta w_{i1}} & \\frac{\\delta l}{\\delta w_{i2}} & \\cdots & \\frac{\\delta l}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} \\frac{\\delta z_1}{\\delta w_{i1}} & \\frac{\\delta z_1}{\\delta w_{i1}} & \\cdots & \\frac{\\delta z_1}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} x_1 & x_2 & \\cdots & x_n \\end{bmatrix}$\n",
+    "\n",
+    "Similarly, the gradient vector for the biases is given by\n",
+    "$\\frac{\\delta l}{\\delta \\vec{B}} = \\frac{\\delta l}{\\delta \\vec{Z}} \\frac{\\delta \\vec{Z}}{\\delta \\vec{B}} = \\vec{1} \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n",
+    "\n",
+    "## Gradients of the Loss with Respect to Inputs\n",
+    "When chaining multiple layers together, we will need the partial derivatives of the loss with respect to the next layers input (ie, the output of the current layer). This involves extra summation because the output of 1 layer is fed into every neuron of the next layer, so the total loss must be found.\n",
+    "\n",
+    "The gradient of the loss with respect to the $n$ input fed into $i$ neurons is\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta x_n} = \\frac{\\delta l}{\\delta z_1} \\frac{\\delta z_1}{\\delta x_n} + \\frac{\\delta l}{\\delta z_2} \\frac{\\delta z_2}{\\delta x_n} + ... + \\frac{\\delta l}{\\delta z_i} \\frac{\\delta z_i}{\\delta x_n}$\n",
+    "\n",
+    "\n",
+    "Noting that $\\frac{\\delta z_i}{\\delta x_n} = w_{in}$ allows us to have\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta \\vec{X}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta x_1} & \\frac{\\delta l}{\\delta x_2} & \\cdots & \\frac{\\delta l}{\\delta x_n} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in} \\end{bmatrix}$\n",
+    "\n",
+    "## Note With Layer_Dense class\n",
+    "The Layer_Dense class has the weights stored in the transposed fashion for forward propagation. Therefore, the weight matrix must be transposed for the backpropagation."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}