From 14fa6a99f301e15830aca44e27e570c547d820d1 Mon Sep 17 00:00:00 2001
From: judsonupchurch <judson.upchurch@gmail.com>
Date: Wed, 1 Jan 2025 23:02:11 +0000
Subject: [PATCH] Matrix Representation of Gradient Descent Notes

---
 lecture13_17/handout_13.ipynb |  4 +-
 lecture13_17/notes_13.ipynb   | 83 ++++++++++++++++++++++++++++-------
 2 files changed, 68 insertions(+), 19 deletions(-)

diff --git a/lecture13_17/handout_13.ipynb b/lecture13_17/handout_13.ipynb
index 2a7cb7d..dd5fe0a 100644
--- a/lecture13_17/handout_13.ipynb
+++ b/lecture13_17/handout_13.ipynb
@@ -2016,7 +2016,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -2030,7 +2030,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.2"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
diff --git a/lecture13_17/notes_13.ipynb b/lecture13_17/notes_13.ipynb
index 578e1d2..5422a26 100644
--- a/lecture13_17/notes_13.ipynb
+++ b/lecture13_17/notes_13.ipynb
@@ -469,6 +469,13 @@
     "print(\"Final biases:\\n\", biases)\n"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -478,7 +485,7 @@
     "\n",
     "$\\vec{X_j} = \\begin{bmatrix} x_{1j} & x_{2j} & \\cdots & x_{nj} \\end{bmatrix}$ -> Row vector for the layer inputs for the $j$ batch of data.\n",
     "\n",
-    "$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values.\n",
+    "$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values. Each row is a neuron's weights and each column is the weights for a given input.\n",
     "\n",
     "$\\vec{B} = \\begin{bmatrix} b_1 & b_2 & \\cdots & b_i \\end{bmatrix}$ -> Row vector for the neuron biases\n",
     "\n",
@@ -488,12 +495,49 @@
     "\n",
     "$y_j$ -> Final layer output for the $j$ batch of data if the layer is the final layer (could be summation, probability, etc).\n",
     "\n",
-    "$l_j$ -> Loss for the $j$ batch of data."
+    "$l_j$ -> Loss for the $j$ batch of data.\n",
+    "\n",
+    "The $j$ is often dropped because we typically only need to think with 1 set of input data.\n",
+    "\n",
+    "### Gradient Descent Using New Notation\n",
+    "We will look at the weight that the $i$ neuron applies for the $n$ input.\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta w_{in}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta w_{in}}$\n",
+    "\n",
+    "Similarly, for the bias of the $i$ neuron, there is\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta b_{i}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta b_{i}}$\n",
+    "\n",
+    "For the system we are using, where $l = (y-0)^2$ and the activation layer is ReLU, we have\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta y} = 2y$\n",
+    "\n",
+    "$\\frac{\\delta y}{\\delta a_i} = 1$\n",
+    "\n",
+    "$\\frac{\\delta a_i}{\\delta z_i} = 1$ if $z_i > 0$ else $0$\n",
+    "\n",
+    "$\\frac{\\delta z_i}{\\delta w_{in}} = x_n$\n",
+    "\n",
+    "$\\frac{\\delta z_i}{\\delta b_{i}} = 1$\n",
+    "\n",
+    "### Matrix Representation of Gradient Descent\n",
+    "We can simplify by seeing that $\\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} = \\frac{\\delta l}{\\delta z_i}$ is a common term.\n",
+    "\n",
+    "We take $\\frac{\\delta l}{\\delta z_i}$ and turn it into a 1 x $i$ vector that such that \n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta \\vec{Z}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n",
+    "\n",
+    "We than can get that the gradient matrix for all weights is a $i$ x $n$ matrix given by \n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta \\overline{\\overline{W}}} =  \\begin{bmatrix}  \\frac{\\delta l}{\\delta w_{11}} & \\frac{\\delta l}{\\delta w_{12}} & \\cdots & \\frac{\\delta l}{\\delta w_{1n}} \\\\ \\frac{\\delta l}{\\delta w_{21}} & w\\frac{\\delta l}{\\delta w_{22}} & \\cdots & \\frac{\\delta l}{\\delta w_{2n}} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ \\frac{\\delta l}{\\delta w_{i1}} & \\frac{\\delta l}{\\delta w_{i2}} & \\cdots & \\frac{\\delta l}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} \\frac{\\delta z_1}{\\delta w_{i1}} & \\frac{\\delta z_1}{\\delta w_{i1}} & \\cdots & \\frac{\\delta z_1}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} x_1 & x_2 & \\cdots & x_n \\end{bmatrix}$\n",
+    "\n",
+    "Similarly, the gradient vector for the biases is given by\n",
+    "$\\frac{\\delta l}{\\delta \\vec{B}} = \\frac{\\delta l}{\\delta \\vec{Z}} \\frac{\\delta \\vec{Z}}{\\delta \\vec{B}} = \\vec{1} \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
@@ -501,18 +545,18 @@
      "output_type": "stream",
      "text": [
       "Iteration 0, Loss: 466.56000000000006\n",
-      "Iteration 20, Loss: 5.329595763793193\n",
-      "Iteration 40, Loss: 0.41191524253483786\n",
-      "Iteration 60, Loss: 0.03183621475376345\n",
-      "Iteration 80, Loss: 0.002460565405431671\n",
-      "Iteration 100, Loss: 0.0001901729121621426\n",
-      "Iteration 120, Loss: 1.4698120139337557e-05\n",
-      "Iteration 140, Loss: 1.1359948840900371e-06\n",
-      "Iteration 160, Loss: 8.779778427447647e-08\n",
-      "Iteration 180, Loss: 6.785903626216421e-09\n",
+      "Iteration 20, Loss: 5.32959636083938\n",
+      "Iteration 40, Loss: 0.41191523404899866\n",
+      "Iteration 60, Loss: 0.031836212079467595\n",
+      "Iteration 80, Loss: 0.002460565465389601\n",
+      "Iteration 100, Loss: 0.000190172825660145\n",
+      "Iteration 120, Loss: 1.4698126966451542e-05\n",
+      "Iteration 140, Loss: 1.1359926717815175e-06\n",
+      "Iteration 160, Loss: 8.779889800154524e-08\n",
+      "Iteration 180, Loss: 6.7858241357822796e-09\n",
       "Final weights:\n",
-      " [[-0.00698895 -0.0139779  -0.02096685 -0.0279558 ]\n",
-      " [ 0.25975286  0.11950571 -0.02074143 -0.16098857]\n",
+      " [[-0.00698895 -0.01397789 -0.02096684 -0.02795579]\n",
+      " [ 0.25975286  0.11950572 -0.02074143 -0.16098857]\n",
       " [ 0.53548461  0.27096922  0.00645383 -0.25806156]]\n",
       "Final biases:\n",
       " [-0.00698895 -0.04024714 -0.06451539]\n"
@@ -569,12 +613,10 @@
     "    dl_dZ = dL_dy * dy_dA * dA_dZ\n",
     "\n",
     "    # Get the gradient of the Loss with respect to the weights and biases\n",
-    "    # dL_dW = np.outer(dl_dz, X)\n",
-    "    dL_dW = X.reshape(-1, 1) @ dl_dZ.reshape(1, -1)\n",
+    "    dL_dW = np.outer(X.T, dl_dZ)\n",
     "    dL_dB = dl_dZ\n",
     "\n",
     "    # Update the weights and biases\n",
-    "    # Remove the .T if using dL_dW = np.outer(dl_dz, X)\n",
     "    W -= learning_rate * dL_dW.T\n",
     "    B -= learning_rate * dL_dB\n",
     "\n",
@@ -586,6 +628,13 @@
     "print(\"Final weights:\\n\", W)\n",
     "print(\"Final biases:\\n\", B)\n"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {