From 14fa6a99f301e15830aca44e27e570c547d820d1 Mon Sep 17 00:00:00 2001 From: judsonupchurch Date: Wed, 1 Jan 2025 23:02:11 +0000 Subject: [PATCH] Matrix Representation of Gradient Descent Notes --- lecture13_17/handout_13.ipynb | 4 +- lecture13_17/notes_13.ipynb | 83 ++++++++++++++++++++++++++++------- 2 files changed, 68 insertions(+), 19 deletions(-) diff --git a/lecture13_17/handout_13.ipynb b/lecture13_17/handout_13.ipynb index 2a7cb7d..dd5fe0a 100644 --- a/lecture13_17/handout_13.ipynb +++ b/lecture13_17/handout_13.ipynb @@ -2016,7 +2016,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -2030,7 +2030,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.2" + "version": "3.10.12" } }, "nbformat": 4, diff --git a/lecture13_17/notes_13.ipynb b/lecture13_17/notes_13.ipynb index 578e1d2..5422a26 100644 --- a/lecture13_17/notes_13.ipynb +++ b/lecture13_17/notes_13.ipynb @@ -469,6 +469,13 @@ "print(\"Final biases:\\n\", biases)\n" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "metadata": {}, @@ -478,7 +485,7 @@ "\n", "$\\vec{X_j} = \\begin{bmatrix} x_{1j} & x_{2j} & \\cdots & x_{nj} \\end{bmatrix}$ -> Row vector for the layer inputs for the $j$ batch of data.\n", "\n", - "$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values.\n", + "$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values. Each row is a neuron's weights and each column is the weights for a given input.\n", "\n", "$\\vec{B} = \\begin{bmatrix} b_1 & b_2 & \\cdots & b_i \\end{bmatrix}$ -> Row vector for the neuron biases\n", "\n", @@ -488,12 +495,49 @@ "\n", "$y_j$ -> Final layer output for the $j$ batch of data if the layer is the final layer (could be summation, probability, etc).\n", "\n", - "$l_j$ -> Loss for the $j$ batch of data." + "$l_j$ -> Loss for the $j$ batch of data.\n", + "\n", + "The $j$ is often dropped because we typically only need to think with 1 set of input data.\n", + "\n", + "### Gradient Descent Using New Notation\n", + "We will look at the weight that the $i$ neuron applies for the $n$ input.\n", + "\n", + "$\\frac{\\delta l}{\\delta w_{in}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta w_{in}}$\n", + "\n", + "Similarly, for the bias of the $i$ neuron, there is\n", + "\n", + "$\\frac{\\delta l}{\\delta b_{i}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta b_{i}}$\n", + "\n", + "For the system we are using, where $l = (y-0)^2$ and the activation layer is ReLU, we have\n", + "\n", + "$\\frac{\\delta l}{\\delta y} = 2y$\n", + "\n", + "$\\frac{\\delta y}{\\delta a_i} = 1$\n", + "\n", + "$\\frac{\\delta a_i}{\\delta z_i} = 1$ if $z_i > 0$ else $0$\n", + "\n", + "$\\frac{\\delta z_i}{\\delta w_{in}} = x_n$\n", + "\n", + "$\\frac{\\delta z_i}{\\delta b_{i}} = 1$\n", + "\n", + "### Matrix Representation of Gradient Descent\n", + "We can simplify by seeing that $\\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} = \\frac{\\delta l}{\\delta z_i}$ is a common term.\n", + "\n", + "We take $\\frac{\\delta l}{\\delta z_i}$ and turn it into a 1 x $i$ vector that such that \n", + "\n", + "$\\frac{\\delta l}{\\delta \\vec{Z}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n", + "\n", + "We than can get that the gradient matrix for all weights is a $i$ x $n$ matrix given by \n", + "\n", + "$\\frac{\\delta l}{\\delta \\overline{\\overline{W}}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta w_{11}} & \\frac{\\delta l}{\\delta w_{12}} & \\cdots & \\frac{\\delta l}{\\delta w_{1n}} \\\\ \\frac{\\delta l}{\\delta w_{21}} & w\\frac{\\delta l}{\\delta w_{22}} & \\cdots & \\frac{\\delta l}{\\delta w_{2n}} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ \\frac{\\delta l}{\\delta w_{i1}} & \\frac{\\delta l}{\\delta w_{i2}} & \\cdots & \\frac{\\delta l}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} \\frac{\\delta z_1}{\\delta w_{i1}} & \\frac{\\delta z_1}{\\delta w_{i1}} & \\cdots & \\frac{\\delta z_1}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} x_1 & x_2 & \\cdots & x_n \\end{bmatrix}$\n", + "\n", + "Similarly, the gradient vector for the biases is given by\n", + "$\\frac{\\delta l}{\\delta \\vec{B}} = \\frac{\\delta l}{\\delta \\vec{Z}} \\frac{\\delta \\vec{Z}}{\\delta \\vec{B}} = \\vec{1} \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -501,18 +545,18 @@ "output_type": "stream", "text": [ "Iteration 0, Loss: 466.56000000000006\n", - "Iteration 20, Loss: 5.329595763793193\n", - "Iteration 40, Loss: 0.41191524253483786\n", - "Iteration 60, Loss: 0.03183621475376345\n", - "Iteration 80, Loss: 0.002460565405431671\n", - "Iteration 100, Loss: 0.0001901729121621426\n", - "Iteration 120, Loss: 1.4698120139337557e-05\n", - "Iteration 140, Loss: 1.1359948840900371e-06\n", - "Iteration 160, Loss: 8.779778427447647e-08\n", - "Iteration 180, Loss: 6.785903626216421e-09\n", + "Iteration 20, Loss: 5.32959636083938\n", + "Iteration 40, Loss: 0.41191523404899866\n", + "Iteration 60, Loss: 0.031836212079467595\n", + "Iteration 80, Loss: 0.002460565465389601\n", + "Iteration 100, Loss: 0.000190172825660145\n", + "Iteration 120, Loss: 1.4698126966451542e-05\n", + "Iteration 140, Loss: 1.1359926717815175e-06\n", + "Iteration 160, Loss: 8.779889800154524e-08\n", + "Iteration 180, Loss: 6.7858241357822796e-09\n", "Final weights:\n", - " [[-0.00698895 -0.0139779 -0.02096685 -0.0279558 ]\n", - " [ 0.25975286 0.11950571 -0.02074143 -0.16098857]\n", + " [[-0.00698895 -0.01397789 -0.02096684 -0.02795579]\n", + " [ 0.25975286 0.11950572 -0.02074143 -0.16098857]\n", " [ 0.53548461 0.27096922 0.00645383 -0.25806156]]\n", "Final biases:\n", " [-0.00698895 -0.04024714 -0.06451539]\n" @@ -569,12 +613,10 @@ " dl_dZ = dL_dy * dy_dA * dA_dZ\n", "\n", " # Get the gradient of the Loss with respect to the weights and biases\n", - " # dL_dW = np.outer(dl_dz, X)\n", - " dL_dW = X.reshape(-1, 1) @ dl_dZ.reshape(1, -1)\n", + " dL_dW = np.outer(X.T, dl_dZ)\n", " dL_dB = dl_dZ\n", "\n", " # Update the weights and biases\n", - " # Remove the .T if using dL_dW = np.outer(dl_dz, X)\n", " W -= learning_rate * dL_dW.T\n", " B -= learning_rate * dL_dB\n", "\n", @@ -586,6 +628,13 @@ "print(\"Final weights:\\n\", W)\n", "print(\"Final biases:\\n\", B)\n" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {