Matrix Representation of Gradient Descent Notes

2025-01-01 23:02:11 +00:00 · 2025-01-01 23:02:11 +00:00 · 14fa6a99f3
commit 14fa6a99f3
parent 23fca634fb
2 changed files with 68 additions and 19 deletions
--- a/lecture13_17/handout_13.ipynb
+++ b/lecture13_17/handout_13.ipynb
@ -2016,7 +2016,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
@ -2030,7 +2030,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.12.2"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/lecture13_17/notes_13.ipynb
+++ b/lecture13_17/notes_13.ipynb
@ -469,6 +469,13 @@
    "print(\"Final biases:\\n\", biases)\n"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -478,7 +485,7 @@
    "\n",
    "$\\vec{X_j} = \\begin{bmatrix} x_{1j} & x_{2j} & \\cdots & x_{nj} \\end{bmatrix}$ -> Row vector for the layer inputs for the $j$ batch of data.\n",
    "\n",
-    "$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values.\n",
+    "$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values. Each row is a neuron's weights and each column is the weights for a given input.\n",
    "\n",
    "$\\vec{B} = \\begin{bmatrix} b_1 & b_2 & \\cdots & b_i \\end{bmatrix}$ -> Row vector for the neuron biases\n",
    "\n",
@ -488,12 +495,49 @@
    "\n",
    "$y_j$ -> Final layer output for the $j$ batch of data if the layer is the final layer (could be summation, probability, etc).\n",
    "\n",
-    "$l_j$ -> Loss for the $j$ batch of data."
+    "$l_j$ -> Loss for the $j$ batch of data.\n",
+    "\n",
+    "The $j$ is often dropped because we typically only need to think with 1 set of input data.\n",
+    "\n",
+    "### Gradient Descent Using New Notation\n",
+    "We will look at the weight that the $i$ neuron applies for the $n$ input.\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta w_{in}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta w_{in}}$\n",
+    "\n",
+    "Similarly, for the bias of the $i$ neuron, there is\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta b_{i}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta b_{i}}$\n",
+    "\n",
+    "For the system we are using, where $l = (y-0)^2$ and the activation layer is ReLU, we have\n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta y} = 2y$\n",
+    "\n",
+    "$\\frac{\\delta y}{\\delta a_i} = 1$\n",
+    "\n",
+    "$\\frac{\\delta a_i}{\\delta z_i} = 1$ if $z_i > 0$ else $0$\n",
+    "\n",
+    "$\\frac{\\delta z_i}{\\delta w_{in}} = x_n$\n",
+    "\n",
+    "$\\frac{\\delta z_i}{\\delta b_{i}} = 1$\n",
+    "\n",
+    "### Matrix Representation of Gradient Descent\n",
+    "We can simplify by seeing that $\\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} = \\frac{\\delta l}{\\delta z_i}$ is a common term.\n",
+    "\n",
+    "We take $\\frac{\\delta l}{\\delta z_i}$ and turn it into a 1 x $i$ vector that such that \n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta \\vec{Z}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n",
+    "\n",
+    "We than can get that the gradient matrix for all weights is a $i$ x $n$ matrix given by \n",
+    "\n",
+    "$\\frac{\\delta l}{\\delta \\overline{\\overline{W}}} =  \\begin{bmatrix}  \\frac{\\delta l}{\\delta w_{11}} & \\frac{\\delta l}{\\delta w_{12}} & \\cdots & \\frac{\\delta l}{\\delta w_{1n}} \\\\ \\frac{\\delta l}{\\delta w_{21}} & w\\frac{\\delta l}{\\delta w_{22}} & \\cdots & \\frac{\\delta l}{\\delta w_{2n}} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ \\frac{\\delta l}{\\delta w_{i1}} & \\frac{\\delta l}{\\delta w_{i2}} & \\cdots & \\frac{\\delta l}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} \\frac{\\delta z_1}{\\delta w_{i1}} & \\frac{\\delta z_1}{\\delta w_{i1}} & \\cdots & \\frac{\\delta z_1}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} x_1 & x_2 & \\cdots & x_n \\end{bmatrix}$\n",
+    "\n",
+    "Similarly, the gradient vector for the biases is given by\n",
+    "$\\frac{\\delta l}{\\delta \\vec{B}} = \\frac{\\delta l}{\\delta \\vec{Z}} \\frac{\\delta \\vec{Z}}{\\delta \\vec{B}} = \\vec{1} \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
@ -501,18 +545,18 @@
     "output_type": "stream",
     "text": [
      "Iteration 0, Loss: 466.56000000000006\n",
-      "Iteration 20, Loss: 5.329595763793193\n",
-      "Iteration 40, Loss: 0.41191524253483786\n",
-      "Iteration 60, Loss: 0.03183621475376345\n",
-      "Iteration 80, Loss: 0.002460565405431671\n",
-      "Iteration 100, Loss: 0.0001901729121621426\n",
-      "Iteration 120, Loss: 1.4698120139337557e-05\n",
-      "Iteration 140, Loss: 1.1359948840900371e-06\n",
-      "Iteration 160, Loss: 8.779778427447647e-08\n",
-      "Iteration 180, Loss: 6.785903626216421e-09\n",
+      "Iteration 20, Loss: 5.32959636083938\n",
+      "Iteration 40, Loss: 0.41191523404899866\n",
+      "Iteration 60, Loss: 0.031836212079467595\n",
+      "Iteration 80, Loss: 0.002460565465389601\n",
+      "Iteration 100, Loss: 0.000190172825660145\n",
+      "Iteration 120, Loss: 1.4698126966451542e-05\n",
+      "Iteration 140, Loss: 1.1359926717815175e-06\n",
+      "Iteration 160, Loss: 8.779889800154524e-08\n",
+      "Iteration 180, Loss: 6.7858241357822796e-09\n",
      "Final weights:\n",
-      " [[-0.00698895 -0.0139779  -0.02096685 -0.0279558 ]\n",
-      " [ 0.25975286  0.11950571 -0.02074143 -0.16098857]\n",
+      " [[-0.00698895 -0.01397789 -0.02096684 -0.02795579]\n",
+      " [ 0.25975286  0.11950572 -0.02074143 -0.16098857]\n",
      " [ 0.53548461  0.27096922  0.00645383 -0.25806156]]\n",
      "Final biases:\n",
      " [-0.00698895 -0.04024714 -0.06451539]\n"
@ -569,12 +613,10 @@
    "    dl_dZ = dL_dy * dy_dA * dA_dZ\n",
    "\n",
    "    # Get the gradient of the Loss with respect to the weights and biases\n",
-    "    # dL_dW = np.outer(dl_dz, X)\n",
-    "    dL_dW = X.reshape(-1, 1) @ dl_dZ.reshape(1, -1)\n",
+    "    dL_dW = np.outer(X.T, dl_dZ)\n",
    "    dL_dB = dl_dZ\n",
    "\n",
    "    # Update the weights and biases\n",
-    "    # Remove the .T if using dL_dW = np.outer(dl_dz, X)\n",
    "    W -= learning_rate * dL_dW.T\n",
    "    B -= learning_rate * dL_dB\n",
    "\n",
@ -586,6 +628,13 @@
    "print(\"Final weights:\\n\", W)\n",
    "print(\"Final biases:\\n\", B)\n"
   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {