Matrix Representation of Gradient Descent Notes

This commit is contained in:
judsonupchurch 2025-01-01 23:02:11 +00:00
parent 23fca634fb
commit 14fa6a99f3
2 changed files with 68 additions and 19 deletions

View File

@ -2016,7 +2016,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
@ -2030,7 +2030,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
"version": "3.10.12"
}
},
"nbformat": 4,

View File

@ -469,6 +469,13 @@
"print(\"Final biases:\\n\", biases)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
@ -478,7 +485,7 @@
"\n",
"$\\vec{X_j} = \\begin{bmatrix} x_{1j} & x_{2j} & \\cdots & x_{nj} \\end{bmatrix}$ -> Row vector for the layer inputs for the $j$ batch of data.\n",
"\n",
"$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values.\n",
"$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values. Each row is a neuron's weights and each column is the weights for a given input.\n",
"\n",
"$\\vec{B} = \\begin{bmatrix} b_1 & b_2 & \\cdots & b_i \\end{bmatrix}$ -> Row vector for the neuron biases\n",
"\n",
@ -488,12 +495,49 @@
"\n",
"$y_j$ -> Final layer output for the $j$ batch of data if the layer is the final layer (could be summation, probability, etc).\n",
"\n",
"$l_j$ -> Loss for the $j$ batch of data."
"$l_j$ -> Loss for the $j$ batch of data.\n",
"\n",
"The $j$ is often dropped because we typically only need to think with 1 set of input data.\n",
"\n",
"### Gradient Descent Using New Notation\n",
"We will look at the weight that the $i$ neuron applies for the $n$ input.\n",
"\n",
"$\\frac{\\delta l}{\\delta w_{in}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta w_{in}}$\n",
"\n",
"Similarly, for the bias of the $i$ neuron, there is\n",
"\n",
"$\\frac{\\delta l}{\\delta b_{i}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta b_{i}}$\n",
"\n",
"For the system we are using, where $l = (y-0)^2$ and the activation layer is ReLU, we have\n",
"\n",
"$\\frac{\\delta l}{\\delta y} = 2y$\n",
"\n",
"$\\frac{\\delta y}{\\delta a_i} = 1$\n",
"\n",
"$\\frac{\\delta a_i}{\\delta z_i} = 1$ if $z_i > 0$ else $0$\n",
"\n",
"$\\frac{\\delta z_i}{\\delta w_{in}} = x_n$\n",
"\n",
"$\\frac{\\delta z_i}{\\delta b_{i}} = 1$\n",
"\n",
"### Matrix Representation of Gradient Descent\n",
"We can simplify by seeing that $\\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} = \\frac{\\delta l}{\\delta z_i}$ is a common term.\n",
"\n",
"We take $\\frac{\\delta l}{\\delta z_i}$ and turn it into a 1 x $i$ vector that such that \n",
"\n",
"$\\frac{\\delta l}{\\delta \\vec{Z}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n",
"\n",
"We than can get that the gradient matrix for all weights is a $i$ x $n$ matrix given by \n",
"\n",
"$\\frac{\\delta l}{\\delta \\overline{\\overline{W}}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta w_{11}} & \\frac{\\delta l}{\\delta w_{12}} & \\cdots & \\frac{\\delta l}{\\delta w_{1n}} \\\\ \\frac{\\delta l}{\\delta w_{21}} & w\\frac{\\delta l}{\\delta w_{22}} & \\cdots & \\frac{\\delta l}{\\delta w_{2n}} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ \\frac{\\delta l}{\\delta w_{i1}} & \\frac{\\delta l}{\\delta w_{i2}} & \\cdots & \\frac{\\delta l}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} \\frac{\\delta z_1}{\\delta w_{i1}} & \\frac{\\delta z_1}{\\delta w_{i1}} & \\cdots & \\frac{\\delta z_1}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} x_1 & x_2 & \\cdots & x_n \\end{bmatrix}$\n",
"\n",
"Similarly, the gradient vector for the biases is given by\n",
"$\\frac{\\delta l}{\\delta \\vec{B}} = \\frac{\\delta l}{\\delta \\vec{Z}} \\frac{\\delta \\vec{Z}}{\\delta \\vec{B}} = \\vec{1} \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 7,
"metadata": {},
"outputs": [
{
@ -501,18 +545,18 @@
"output_type": "stream",
"text": [
"Iteration 0, Loss: 466.56000000000006\n",
"Iteration 20, Loss: 5.329595763793193\n",
"Iteration 40, Loss: 0.41191524253483786\n",
"Iteration 60, Loss: 0.03183621475376345\n",
"Iteration 80, Loss: 0.002460565405431671\n",
"Iteration 100, Loss: 0.0001901729121621426\n",
"Iteration 120, Loss: 1.4698120139337557e-05\n",
"Iteration 140, Loss: 1.1359948840900371e-06\n",
"Iteration 160, Loss: 8.779778427447647e-08\n",
"Iteration 180, Loss: 6.785903626216421e-09\n",
"Iteration 20, Loss: 5.32959636083938\n",
"Iteration 40, Loss: 0.41191523404899866\n",
"Iteration 60, Loss: 0.031836212079467595\n",
"Iteration 80, Loss: 0.002460565465389601\n",
"Iteration 100, Loss: 0.000190172825660145\n",
"Iteration 120, Loss: 1.4698126966451542e-05\n",
"Iteration 140, Loss: 1.1359926717815175e-06\n",
"Iteration 160, Loss: 8.779889800154524e-08\n",
"Iteration 180, Loss: 6.7858241357822796e-09\n",
"Final weights:\n",
" [[-0.00698895 -0.0139779 -0.02096685 -0.0279558 ]\n",
" [ 0.25975286 0.11950571 -0.02074143 -0.16098857]\n",
" [[-0.00698895 -0.01397789 -0.02096684 -0.02795579]\n",
" [ 0.25975286 0.11950572 -0.02074143 -0.16098857]\n",
" [ 0.53548461 0.27096922 0.00645383 -0.25806156]]\n",
"Final biases:\n",
" [-0.00698895 -0.04024714 -0.06451539]\n"
@ -569,12 +613,10 @@
" dl_dZ = dL_dy * dy_dA * dA_dZ\n",
"\n",
" # Get the gradient of the Loss with respect to the weights and biases\n",
" # dL_dW = np.outer(dl_dz, X)\n",
" dL_dW = X.reshape(-1, 1) @ dl_dZ.reshape(1, -1)\n",
" dL_dW = np.outer(X.T, dl_dZ)\n",
" dL_dB = dl_dZ\n",
"\n",
" # Update the weights and biases\n",
" # Remove the .T if using dL_dW = np.outer(dl_dz, X)\n",
" W -= learning_rate * dL_dW.T\n",
" B -= learning_rate * dL_dB\n",
"\n",
@ -586,6 +628,13 @@
"print(\"Final weights:\\n\", W)\n",
"print(\"Final biases:\\n\", B)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {