Neural-Networks-From-Scratch/lecture13_17/notes_13.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Previous Class Definitions\n",
    "The previously defined Layer_Dense, Activation_ReLU, Activation_Softmax, Loss, and Loss_CategoricalCrossEntropy classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import nnfs\n",
    "from nnfs.datasets import spiral_data, vertical_data\n",
    "nnfs.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Layer_Dense:\n",
    "    def __init__(self, n_inputs, n_neurons):\n",
    "        # Initialize the weights and biases\n",
    "        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)  # Normal distribution of weights\n",
    "        self.biases = np.zeros((1, n_neurons))\n",
    "\n",
    "    def forward(self, inputs):\n",
    "        # Calculate the output values from inputs, weights, and biases\n",
    "        self.output = np.dot(inputs, self.weights) + self.biases        # Weights are already transposed\n",
    "\n",
    "class Activation_ReLU:\n",
    "    def forward(self, inputs):\n",
    "        self.output = np.maximum(0, inputs)\n",
    "        \n",
    "class Activation_Softmax:\n",
    "    def forward(self, inputs):\n",
    "        # Get the unnormalized probabilities\n",
    "        # Subtract max from the row to prevent larger numbers\n",
    "        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))\n",
    "\n",
    "        # Normalize the probabilities with element wise division\n",
    "        probabilities = exp_values / np.sum(exp_values, axis=1,keepdims=True)\n",
    "        self.output = probabilities\n",
    "\n",
    "# Base class for Loss functions\n",
    "class Loss:\n",
    "    '''Calculates the data and regularization losses given\n",
    "    model output and ground truth values'''\n",
    "    def calculate(self, output, y):\n",
    "        sample_losses = self.forward(output, y)\n",
    "        data_loss = np.average(sample_losses)\n",
    "        return data_loss\n",
    "\n",
    "class Loss_CategoricalCrossEntropy(Loss):\n",
    "    def forward(self, y_pred, y_true):\n",
    "        '''y_pred is the neural network output\n",
    "        y_true is the ideal output of the neural network'''\n",
    "        samples = len(y_pred)\n",
    "        # Bound the predicted values \n",
    "        y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)\n",
    "        \n",
    "        if len(y_true.shape) == 1:     # Categorically labeled\n",
    "            correct_confidences = y_pred_clipped[range(samples), y_true]\n",
    "        elif len(y_true.shape) == 2:   # One hot encoded\n",
    "            correct_confidences = np.sum(y_pred_clipped*y_true, axis=1)\n",
    "\n",
    "        # Calculate the losses\n",
    "        negative_log_likelihoods = -np.log(correct_confidences)\n",
    "        return negative_log_likelihoods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Backpropagation of a Single Neuron\n",
    "Backpropagation helps us find the gradient of the neural network with respect to each of the parameters (weights and biases) of each neuron.\n",
    "\n",
    "Imagine a layer that has 3 inputs and 1 neuron. There are 3 inputs (x0, x1, x2), three weights (w0, w1, w2), 1 bias (b0), and 1 output (z). There is a ReLU activation layer after the neuron output going into a square loss function (loss = z^2).\n",
    "\n",
    "Loss = (ReLU(sum(mul(x0, w0), mul(x1, w1), mul(x2, w2(, b0)))))^2\n",
    "\n",
    "$\\frac{\\delta Loss()}{\\delta w0} = \\frac{\\delta Loss()}{\\delta ReLU()} * \\frac{\\delta ReLU()}{\\delta sum()} * \\frac{\\delta sum()}{\\delta mul(x0, w0)} * \\frac{\\delta mul(x0, w0)}{\\delta w0}$\n",
    "\n",
    "$\\frac{\\delta Loss()}{\\delta ReLU()} = 2 * ReLU(sum(...))$\n",
    "\n",
    "$\\frac{\\delta ReLU()}{\\delta sum()}$ = 0 if sum(...) is less than 0 and 1 if sum(...) is greater than 0\n",
    "\n",
    "$\\frac{\\delta sum()}{\\delta mul(x0, w0)} = 1$\n",
    "\n",
    "$\\frac{\\delta mul(x0, w0)}{\\delta w0} = x0$\n",
    "\n",
    "This is repeated for w0, w1, w2, b0.\n",
    "\n",
    "We then use numerical differentiation to approximate the gradient. Then, we update the parameters using small step sizes, such that $w0[i+1] = w0[i] - step*\\frac{\\delta Loss()}{\\delta w0}$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration 1, Loss: 36.0\n",
      "Iteration 2, Loss: 33.872397424621624\n",
      "Iteration 3, Loss: 31.87054345809546\n",
      "Iteration 4, Loss: 29.98699091998773\n",
      "Iteration 5, Loss: 28.214761511794592\n",
      "Iteration 6, Loss: 26.54726775906168\n",
      "Iteration 7, Loss: 24.978326552541866\n",
      "Iteration 8, Loss: 23.5021050739742\n",
      "Iteration 9, Loss: 22.11313179151597\n",
      "Iteration 10, Loss: 20.806246424284897\n",
      "Iteration 11, Loss: 19.576596334671486\n",
      "Iteration 12, Loss: 18.41961908608719\n",
      "Iteration 13, Loss: 17.33101994032309\n",
      "Iteration 14, Loss: 16.306757070164853\n",
      "Iteration 15, Loss: 15.343027506224132\n",
      "Iteration 16, Loss: 14.436253786815284\n",
      "Iteration 17, Loss: 13.583071280700132\n",
      "Iteration 18, Loss: 12.780312744165439\n",
      "Iteration 19, Loss: 12.024995767388878\n",
      "Iteration 20, Loss: 11.314319082257104\n",
      "Iteration 21, Loss: 10.64564263994962\n",
      "Iteration 22, Loss: 10.016485041642266\n",
      "Iteration 23, Loss: 9.424510031713222\n",
      "Iteration 24, Loss: 8.867521365009814\n",
      "Iteration 25, Loss: 8.34345204094211\n",
      "Iteration 26, Loss: 7.850353118483743\n",
      "Iteration 27, Loss: 7.386397874602818\n",
      "Iteration 28, Loss: 6.94986173712617\n",
      "Iteration 29, Loss: 6.539124434950737\n",
      "Iteration 30, Loss: 6.1526621719118015\n",
      "Iteration 31, Loss: 5.789039869058961\n",
      "Iteration 32, Loss: 5.446907999417336\n",
      "Iteration 33, Loss: 5.124995576577539\n",
      "Iteration 34, Loss: 4.822108497170647\n",
      "Iteration 35, Loss: 4.537121521071987\n",
      "Iteration 36, Loss: 4.268978030723312\n",
      "Iteration 37, Loss: 4.01668121563854\n",
      "Iteration 38, Loss: 3.7792956126389763\n",
      "Iteration 39, Loss: 3.5559389510643094\n",
      "Iteration 40, Loss: 3.345782865003274\n",
      "Iteration 41, Loss: 3.1480471758404285\n",
      "Iteration 42, Loss: 2.961997679823884\n",
      "Iteration 43, Loss: 2.78694359065541\n",
      "Iteration 44, Loss: 2.622235303237792\n",
      "Iteration 45, Loss: 2.467261121418954\n",
      "Iteration 46, Loss: 2.321446092335641\n",
      "Iteration 47, Loss: 2.184248486806066\n",
      "Iteration 48, Loss: 2.0551593804914616\n",
      "Iteration 49, Loss: 1.9336995852420789\n",
      "Iteration 50, Loss: 1.8194178573235094\n",
      "Iteration 51, Loss: 1.7118903069357754\n",
      "Iteration 52, Loss: 1.6107175940030252\n",
      "Iteration 53, Loss: 1.5155241897377694\n",
      "Iteration 54, Loss: 1.4259567411109748\n",
      "Iteration 55, Loss: 1.3416826255281136\n",
      "Iteration 56, Loss: 1.262389208248047\n",
      "Iteration 57, Loss: 1.1877819791340551\n",
      "Iteration 58, Loss: 1.1175840765571434\n",
      "Iteration 59, Loss: 1.0515348500680068\n",
      "Iteration 60, Loss: 0.9893891461492582\n",
      "Iteration 61, Loss: 0.930916260625565\n",
      "Iteration 62, Loss: 0.875899078709395\n",
      "Iteration 63, Loss: 0.8241334819517507\n",
      "Iteration 64, Loss: 0.7754271861095672\n",
      "Iteration 65, Loss: 0.7295994320679934\n",
      "Iteration 66, Loss: 0.6864801042040583\n",
      "Iteration 67, Loss: 0.6459091389617334\n",
      "Iteration 68, Loss: 0.6077358933180028\n",
      "Iteration 69, Loss: 0.5718187120029812\n",
      "Iteration 70, Loss: 0.5380242202642829\n",
      "Iteration 71, Loss: 0.5062269967452033\n",
      "Iteration 72, Loss: 0.4763089781884024\n",
      "Iteration 73, Loss: 0.4481591180173807\n",
      "Iteration 74, Loss: 0.42167291418136477\n",
      "Iteration 75, Loss: 0.3967520449790852\n",
      "Iteration 76, Loss: 0.3733039992368791\n",
      "Iteration 77, Loss: 0.3512417316144445\n",
      "Iteration 78, Loss: 0.33048334753976116\n",
      "Iteration 79, Loss: 0.31095177724411444\n",
      "Iteration 80, Loss: 0.2925745286179104\n",
      "Iteration 81, Loss: 0.2752833763568879\n",
      "Iteration 82, Loss: 0.25901412505149535\n",
      "Iteration 83, Loss: 0.2437063914735247\n",
      "Iteration 84, Loss: 0.22930333977371198\n",
      "Iteration 85, Loss: 0.21575151284725816\n",
      "Iteration 86, Loss: 0.2030006012946216\n",
      "Iteration 87, Loss: 0.19100326852350488\n",
      "Iteration 88, Loss: 0.17971497196649536\n",
      "Iteration 89, Loss: 0.1690938194815031\n",
      "Iteration 90, Loss: 0.1591003719214838\n",
      "Iteration 91, Loss: 0.14969754273736763\n",
      "Iteration 92, Loss: 0.14085041966208015\n",
      "Iteration 93, Loss: 0.13252615564761738\n",
      "Iteration 94, Loss: 0.1246938532452423\n",
      "Iteration 95, Loss: 0.11732446503349986\n",
      "Iteration 96, Loss: 0.11039058885430607\n",
      "Iteration 97, Loss: 0.10386649785129919\n",
      "Iteration 98, Loss: 0.09772798570124883\n",
      "Iteration 99, Loss: 0.09195226348280558\n",
      "Iteration 100, Loss: 0.0865178816583512\n",
      "Iteration 101, Loss: 0.08140467291758889\n",
      "Iteration 102, Loss: 0.07659366262828358\n",
      "Iteration 103, Loss: 0.07206697005843195\n",
      "Iteration 104, Loss: 0.06780781192053903\n",
      "Iteration 105, Loss: 0.06380037696069592\n",
      "Iteration 106, Loss: 0.06002977345222309\n",
      "Iteration 107, Loss: 0.0564820075507719\n",
      "Iteration 108, Loss: 0.05314393144118542\n",
      "Iteration 109, Loss: 0.050003114234231524\n",
      "Iteration 110, Loss: 0.04704793686603195\n",
      "Iteration 111, Loss: 0.04426740148833972\n",
      "Iteration 112, Loss: 0.04165120020443161\n",
      "Iteration 113, Loss: 0.03918961375201954\n",
      "Iteration 114, Loss: 0.0368735034129829\n",
      "Iteration 115, Loss: 0.034694277992582755\n",
      "Iteration 116, Loss: 0.032643851730490094\n",
      "Iteration 117, Loss: 0.03071459534999028\n",
      "Iteration 118, Loss: 0.028899363239415818\n",
      "Iteration 119, Loss: 0.027191414181739672\n",
      "Iteration 120, Loss: 0.02558439994540113\n",
      "Iteration 121, Loss: 0.024072362337913877\n",
      "Iteration 122, Loss: 0.022649683089386127\n",
      "Iteration 123, Loss: 0.021311092099735786\n",
      "Iteration 124, Loss: 0.02005160424149179\n",
      "Iteration 125, Loss: 0.01886655505507656\n",
      "Iteration 126, Loss: 0.017751540667355833\n",
      "Iteration 127, Loss: 0.016702427744061103\n",
      "Iteration 128, Loss: 0.01571531497821091\n",
      "Iteration 129, Loss: 0.014786535770396103\n",
      "Iteration 130, Loss: 0.013912651762769943\n",
      "Iteration 131, Loss: 0.013090418519936803\n",
      "Iteration 132, Loss: 0.012316768931710837\n",
      "Iteration 133, Loss: 0.011588849600126475\n",
      "Iteration 134, Loss: 0.010903943586632107\n",
      "Iteration 135, Loss: 0.010259526183227799\n",
      "Iteration 136, Loss: 0.009653186757193668\n",
      "Iteration 137, Loss: 0.009082688171817357\n",
      "Iteration 138, Loss: 0.008545899068542421\n",
      "Iteration 139, Loss: 0.00804083320361364\n",
      "Iteration 140, Loss: 0.007565618804557518\n",
      "Iteration 141, Loss: 0.007118492429622391\n",
      "Iteration 142, Loss: 0.006697793120481266\n",
      "Iteration 143, Loss: 0.0063019473730584336\n",
      "Iteration 144, Loss: 0.005929501997799936\n",
      "Iteration 145, Loss: 0.005579070290327091\n",
      "Iteration 146, Loss: 0.005249347396309216\n",
      "Iteration 147, Loss: 0.004939114136252681\n",
      "Iteration 148, Loss: 0.004647215154254898\n",
      "Iteration 149, Loss: 0.00437256400626425\n",
      "Iteration 150, Loss: 0.004114139259196158\n",
      "Iteration 151, Loss: 0.0038709956233987848\n",
      "Iteration 152, Loss: 0.0036422222163822442\n",
      "Iteration 153, Loss: 0.0034269635873455254\n",
      "Iteration 154, Loss: 0.0032244300300798123\n",
      "Iteration 155, Loss: 0.003033866206344064\n",
      "Iteration 156, Loss: 0.0028545694817259646\n",
      "Iteration 157, Loss: 0.0026858615040063873\n",
      "Iteration 158, Loss: 0.002527124440860861\n",
      "Iteration 159, Loss: 0.002377772426750458\n",
      "Iteration 160, Loss: 0.0022372501846465924\n",
      "Iteration 161, Loss: 0.002105026221950533\n",
      "Iteration 162, Loss: 0.0019806188966821317\n",
      "Iteration 163, Loss: 0.001863566163059441\n",
      "Iteration 164, Loss: 0.0017534302886055876\n",
      "Iteration 165, Loss: 0.0016498016244949178\n",
      "Iteration 166, Loss: 0.0015522968336895225\n",
      "Iteration 167, Loss: 0.0014605572212372654\n",
      "Iteration 168, Loss: 0.0013742383231737623\n",
      "Iteration 169, Loss: 0.0012930183418168389\n",
      "Iteration 170, Loss: 0.0012166008279945002\n",
      "Iteration 171, Loss: 0.0011447005613673634\n",
      "Iteration 172, Loss: 0.0010770513341135804\n",
      "Iteration 173, Loss: 0.001013397095948145\n",
      "Iteration 174, Loss: 0.0009535029620325111\n",
      "Iteration 175, Loss: 0.0008971534673183893\n",
      "Iteration 176, Loss: 0.0008441301639000644\n",
      "Iteration 177, Loss: 0.0007942435095401501\n",
      "Iteration 178, Loss: 0.0007473036766382048\n",
      "Iteration 179, Loss: 0.0007031374518087182\n",
      "Iteration 180, Loss: 0.0006615806720993984\n",
      "Iteration 181, Loss: 0.0006224808039162045\n",
      "Iteration 182, Loss: 0.0005856932236775429\n",
      "Iteration 183, Loss: 0.0005510780772974099\n",
      "Iteration 184, Loss: 0.0005185112321657664\n",
      "Iteration 185, Loss: 0.00048786689510026934\n",
      "Iteration 186, Loss: 0.00045903387854597503\n",
      "Iteration 187, Loss: 0.00043190420223823955\n",
      "Iteration 188, Loss: 0.000406378034681195\n",
      "Iteration 189, Loss: 0.00038236074013664776\n",
      "Iteration 190, Loss: 0.0003597649139507893\n",
      "Iteration 191, Loss: 0.0003385032407062897\n",
      "Iteration 192, Loss: 0.00031849748027454767\n",
      "Iteration 193, Loss: 0.00029967346881992795\n",
      "Iteration 194, Loss: 0.0002819629431575354\n",
      "Iteration 195, Loss: 0.0002652991815966534\n",
      "Iteration 196, Loss: 0.00024961903501571355\n",
      "Iteration 197, Loss: 0.00023486641976601822\n",
      "Iteration 198, Loss: 0.00022098629075865584\n",
      "Iteration 199, Loss: 0.00020792651372860275\n",
      "Iteration 200, Loss: 0.00019563773612380077\n",
      "Final weights: [-3.3990955  -0.20180899  0.80271349]\n",
      "Final bias: 0.6009044964517248\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Initial parameters\n",
    "weights = np.array([-3.0, -1.0, 2.0])\n",
    "bias = 1.0\n",
    "inputs = np.array([1.0, -2.0, 3.0])\n",
    "target_output = 0.0\n",
    "learning_rate = 0.001\n",
    "\n",
    "def relu(x):\n",
    "    return np.maximum(0, x)\n",
    "\n",
    "def relu_derivative(x):\n",
    "    return np.where(x > 0, 1.0, 0.0)\n",
    "\n",
    "for iteration in range(200):\n",
    "    # Forward pass\n",
    "    linear_output = np.dot(weights, inputs) + bias\n",
    "    output = relu(linear_output)\n",
    "    loss = (output - target_output) ** 2\n",
    "\n",
    "    # Backward pass to calculate gradient\n",
    "    dloss_doutput = 2 * (output - target_output)\n",
    "    doutput_dlinear = relu_derivative(linear_output)\n",
    "    dlinear_dweights = inputs\n",
    "    dlinear_dbias = 1.0\n",
    "\n",
    "    dloss_dlinear = dloss_doutput * doutput_dlinear\n",
    "    dloss_dweights = dloss_dlinear * dlinear_dweights\n",
    "    dloss_dbias = dloss_dlinear * dlinear_dbias\n",
    "\n",
    "    # Update weights and bias\n",
    "    weights -= learning_rate * dloss_dweights\n",
    "    bias -= learning_rate * dloss_dbias\n",
    "\n",
    "    # Print the loss for this iteration\n",
    "    print(f\"Iteration {iteration + 1}, Loss: {loss}\")\n",
    "\n",
    "print(\"Final weights:\", weights)\n",
    "print(\"Final bias:\", bias)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Backpropagation of a Layer\n",
    "Same thing as a single neuron, but now using matrices to keep track of each neuron in the layer.\n",
    "\n",
    "If there are multiple input arrays (batches), one can take the summation of the loss from each batch as a total loss, and therefore the gradient of the total loss with respect to a weight or bias is the summation of the gradients of each batch's loss with respect to the weight or bias given that batch's input.\n",
    "\n",
    "In general, the partial derivative of the loss with respect to a specific weight or bias remains the same across all neurons of that layer for that batch. ie, the weight gradient matrix has the same column vector for N number of neurons. The bias gradient matrix is similar but is a single row of N elements for the same value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration 0, Loss: 466.56000000000006\n",
      "Iteration 20, Loss: 5.329595763793193\n",
      "Iteration 40, Loss: 0.41191524253483786\n",
      "Iteration 60, Loss: 0.03183621475376345\n",
      "Iteration 80, Loss: 0.002460565405431671\n",
      "Iteration 100, Loss: 0.0001901729121621426\n",
      "Iteration 120, Loss: 1.4698120139337557e-05\n",
      "Iteration 140, Loss: 1.1359948840900371e-06\n",
      "Iteration 160, Loss: 8.779778427447647e-08\n",
      "Iteration 180, Loss: 6.785903626216421e-09\n",
      "Final weights:\n",
      " [[-0.00698895 -0.0139779  -0.02096685 -0.0279558 ]\n",
      " [ 0.25975286  0.11950571 -0.02074143 -0.16098857]\n",
      " [ 0.53548461  0.27096922  0.00645383 -0.25806156]]\n",
      "Final biases:\n",
      " [-0.00698895 -0.04024714 -0.06451539]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Initial inputs\n",
    "inputs = np.array([1, 2, 3, 4])\n",
    "\n",
    "# Initial weights and biases\n",
    "weights = np.array([\n",
    "    [0.1, 0.2, 0.3, 0.4],\n",
    "    [0.5, 0.6, 0.7, 0.8],\n",
    "    [0.9, 1.0, 1.1, 1.2]\n",
    "])\n",
    "\n",
    "biases = np.array([0.1, 0.2, 0.3])\n",
    "\n",
    "learning_rate = 0.001\n",
    "\n",
    "# Add the derivative function to the ReLU class\n",
    "class Activation_ReLU:\n",
    "    def forward(self, inputs):\n",
    "        return np.maximum(0, inputs)\n",
    "    \n",
    "    def derivative(self, inputs):\n",
    "        return np.where(inputs > 0, 1, 0)\n",
    "    \n",
    "relu = Activation_ReLU()\n",
    "\n",
    "num_iterations = 200\n",
    "\n",
    "# Training loop\n",
    "# A single layer of 3 neurons, each with 4 inputs\n",
    "# The neuron layer is then fed into a ReLU activation layer\n",
    "for iteration in range(num_iterations):\n",
    "    # Forward pass\n",
    "    neuron_outputs = np.dot(weights, inputs) + biases\n",
    "    relu_outputs = relu.forward(neuron_outputs)\n",
    "    \n",
    "    # Calculate the squared loss assuming the desired output is a sum of 0. Trivial but just an example\n",
    "    final_output = np.sum(relu_outputs)\n",
    "    loss = final_output**2\n",
    "\n",
    "    # Backward pass\n",
    "    dL_dfinal_output = 2 * final_output\n",
    "    dfinal_output_drelu_output = np.ones_like(relu_outputs)\n",
    "    drelu_output_dneuron_output = relu.derivative(neuron_outputs)\n",
    "\n",
    "    dL_dneuron_output = dL_dfinal_output * dfinal_output_drelu_output * drelu_output_dneuron_output\n",
    "\n",
    "    # Get the gradient of the Loss with respect to the weights and biases\n",
    "    # dL_dW = np.outer(dL_dneuron_output, inputs)\n",
    "    dL_dW = inputs.reshape(-1, 1) @ dL_dneuron_output.reshape(1, -1)\n",
    "    dL_db = dL_dneuron_output\n",
    "\n",
    "    # Update the weights and biases\n",
    "    # Remove the .T if using dL_dW = np.outer(dL_dneuron_output, inputs)\n",
    "    weights -= learning_rate * dL_dW.T\n",
    "    biases -= learning_rate * dL_db\n",
    "\n",
    "    # Print the loss every 20 iterations\n",
    "    if iteration % 20 == 0:\n",
    "        print(f\"Iteration {iteration}, Loss: {loss}\")\n",
    "\n",
    "# Final weights and biases\n",
    "print(\"Final weights:\\n\", weights)\n",
    "print(\"Final biases:\\n\", biases)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Change of Notation\n",
    "The previous notation is clunky and long. From here forward, we will use the following notation for a layer with $n$ inputs and $i$ neurons. The neruon layer has is followed by an activation layer and then fed into a final value $y$ with a computed loss $l$. There can be $j$ batches of data.\n",
    "\n",
    "$\\vec{X_j} = \\begin{bmatrix} x_{1j} & x_{2j} & \\cdots & x_{nj} \\end{bmatrix}$ -> Row vector for the layer inputs for the $j$ batch of data.\n",
    "\n",
    "$\\overline{\\overline{W}} = \\begin{bmatrix} \\vec{w_{1}} \\\\ \\vec{w_{2}} \\\\ \\vdots \\\\ \\vec{w_{i}} \\end{bmatrix} = \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in}\\end{bmatrix}$ -> Matrix of weight values. Each row is a neuron's weights and each column is the weights for a given input.\n",
    "\n",
    "$\\vec{B} = \\begin{bmatrix} b_1 & b_2 & \\cdots & b_i \\end{bmatrix}$ -> Row vector for the neuron biases\n",
    "\n",
    "$\\vec{Z_j} = \\begin{bmatrix} z_{1j} & z_{2j} & \\cdots & z_{ij} \\end{bmatrix}$ -> Row vector for the neuron outputs for the $j$ batch of data.\n",
    "\n",
    "$\\vec{A_j} = \\begin{bmatrix} a_{1j} & a_{2j} & \\cdots & a_{ij} \\end{bmatrix}$ -> Row vector for the activation later outputs for the $j$ batch of data.\n",
    "\n",
    "$y_j$ -> Final layer output for the $j$ batch of data if the layer is the final layer (could be summation, probability, etc).\n",
    "\n",
    "$l_j$ -> Loss for the $j$ batch of data.\n",
    "\n",
    "The $j$ is often dropped because we typically only need to think with 1 set of input data.\n",
    "\n",
    "### Gradient Descent Using New Notation\n",
    "We will look at the weight that the $i$ neuron applies for the $n$ input.\n",
    "\n",
    "$\\frac{\\delta l}{\\delta w_{in}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta w_{in}}$\n",
    "\n",
    "Similarly, for the bias of the $i$ neuron, there is\n",
    "\n",
    "$\\frac{\\delta l}{\\delta b_{i}} = \\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} \\frac{\\delta z_i}{\\delta b_{i}}$\n",
    "\n",
    "For the system we are using, where $l = (y-0)^2$ and the activation layer is ReLU, we have\n",
    "\n",
    "$\\frac{\\delta l}{\\delta y} = 2y$\n",
    "\n",
    "$\\frac{\\delta y}{\\delta a_i} = 1$\n",
    "\n",
    "$\\frac{\\delta a_i}{\\delta z_i} = 1$ if $z_i > 0$ else $0$\n",
    "\n",
    "$\\frac{\\delta z_i}{\\delta w_{in}} = x_n$\n",
    "\n",
    "$\\frac{\\delta z_i}{\\delta b_{i}} = 1$\n",
    "\n",
    "### Matrix Representation of Gradient Descent\n",
    "We can simplify by seeing that $\\frac{\\delta l}{\\delta y} \\frac{\\delta y}{\\delta a_i} \\frac{\\delta a_i}{\\delta z_i} = \\frac{\\delta l}{\\delta z_i}$ is a common term.\n",
    "\n",
    "We take $\\frac{\\delta l}{\\delta z_i}$ and turn it into a 1 x $i$ vector that such that \n",
    "\n",
    "$\\frac{\\delta l}{\\delta \\vec{Z}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n",
    "\n",
    "We than can get that the gradient matrix for all weights is a $i$ x $n$ matrix given by \n",
    "\n",
    "$\\frac{\\delta l}{\\delta \\overline{\\overline{W}}} =  \\begin{bmatrix}  \\frac{\\delta l}{\\delta w_{11}} & \\frac{\\delta l}{\\delta w_{12}} & \\cdots & \\frac{\\delta l}{\\delta w_{1n}} \\\\ \\frac{\\delta l}{\\delta w_{21}} & w\\frac{\\delta l}{\\delta w_{22}} & \\cdots & \\frac{\\delta l}{\\delta w_{2n}} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ \\frac{\\delta l}{\\delta w_{i1}} & \\frac{\\delta l}{\\delta w_{i2}} & \\cdots & \\frac{\\delta l}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} \\frac{\\delta z_1}{\\delta w_{i1}} & \\frac{\\delta z_1}{\\delta w_{i1}} & \\cdots & \\frac{\\delta z_1}{\\delta w_{in}} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} \\\\ \\frac{\\delta l}{\\delta z_2} \\\\ \\vdots \\\\ \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} x_1 & x_2 & \\cdots & x_n \\end{bmatrix}$\n",
    "\n",
    "Similarly, the gradient vector for the biases is given by\n",
    "$\\frac{\\delta l}{\\delta \\vec{B}} = \\frac{\\delta l}{\\delta \\vec{Z}} \\frac{\\delta \\vec{Z}}{\\delta \\vec{B}} = \\vec{1} \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_i} \\end{bmatrix}$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration 0, Loss: 466.56000000000006\n",
      "Iteration 20, Loss: 5.32959636083938\n",
      "Iteration 40, Loss: 0.41191523404899866\n",
      "Iteration 60, Loss: 0.031836212079467595\n",
      "Iteration 80, Loss: 0.002460565465389601\n",
      "Iteration 100, Loss: 0.000190172825660145\n",
      "Iteration 120, Loss: 1.4698126966451542e-05\n",
      "Iteration 140, Loss: 1.1359926717815175e-06\n",
      "Iteration 160, Loss: 8.779889800154524e-08\n",
      "Iteration 180, Loss: 6.7858241357822796e-09\n",
      "Final weights:\n",
      " [[-0.00698895 -0.01397789 -0.02096684 -0.02795579]\n",
      " [ 0.25975286  0.11950572 -0.02074143 -0.16098857]\n",
      " [ 0.53548461  0.27096922  0.00645383 -0.25806156]]\n",
      "Final biases:\n",
      " [-0.00698895 -0.04024714 -0.06451539]\n"
     ]
    }
   ],
   "source": [
    "# Code changed to match new notation\n",
    "import numpy as np\n",
    "\n",
    "# Initial inputs\n",
    "X = np.array([1, 2, 3, 4])\n",
    "\n",
    "# Initial weights and biases\n",
    "W = np.array([\n",
    "    [0.1, 0.2, 0.3, 0.4],\n",
    "    [0.5, 0.6, 0.7, 0.8],\n",
    "    [0.9, 1.0, 1.1, 1.2]\n",
    "])\n",
    "\n",
    "B = np.array([0.1, 0.2, 0.3])\n",
    "\n",
    "learning_rate = 0.001\n",
    "\n",
    "# Add the derivative function to the ReLU class\n",
    "class Activation_ReLU:\n",
    "    def forward(self, inputs):\n",
    "        return np.maximum(0, inputs)\n",
    "    \n",
    "    def derivative(self, inputs):\n",
    "        return np.where(inputs > 0, 1, 0)\n",
    "    \n",
    "relu = Activation_ReLU()\n",
    "\n",
    "num_iterations = 200\n",
    "\n",
    "# Training loop\n",
    "# A single layer of 3 neurons, each with 4 inputs\n",
    "# The neuron layer is then fed into a ReLU activation layer\n",
    "for iteration in range(num_iterations):\n",
    "    # Forward pass\n",
    "    Z = np.dot(W, X) + B\n",
    "    A = relu.forward(Z)\n",
    "    \n",
    "    # Calculate the squared loss assuming the desired output is a sum of 0. Trivial but just an example\n",
    "    y = np.sum(A)\n",
    "    l = y**2\n",
    "\n",
    "    # Backward pass\n",
    "    dL_dy = 2 * y\n",
    "    dy_dA = np.ones_like(A)\n",
    "    dA_dZ = relu.derivative(Z)\n",
    "\n",
    "    dl_dZ = dL_dy * dy_dA * dA_dZ\n",
    "\n",
    "    # Get the gradient of the Loss with respect to the weights and biases\n",
    "    dL_dW = np.outer(X.T, dl_dZ)\n",
    "    dL_dB = dl_dZ\n",
    "\n",
    "    # Update the weights and biases\n",
    "    W -= learning_rate * dL_dW.T\n",
    "    B -= learning_rate * dL_dB\n",
    "\n",
    "    # Print the loss every 20 iterations\n",
    "    if iteration % 20 == 0:\n",
    "        print(f\"Iteration {iteration}, Loss: {l}\")\n",
    "\n",
    "# Final weights and biases\n",
    "print(\"Final weights:\\n\", W)\n",
    "print(\"Final biases:\\n\", B)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gradients of the Loss with Respect to Inputs\n",
    "When chaining multiple layers together, we will need the partial derivatives of the loss with respect to the next layers input (ie, the output of the current layer). This involves extra summation because the output of 1 layer is fed into every neuron of the next layer, so the total loss must be found.\n",
    "\n",
    "The gradient of the loss with respect to the $n$ input fed into $i$ neurons is\n",
    "\n",
    "$\\frac{\\delta l}{\\delta x_n} = \\frac{\\delta l}{\\delta z_1} \\frac{\\delta z_1}{\\delta x_n} + \\frac{\\delta l}{\\delta z_2} \\frac{\\delta z_2}{\\delta x_n} + ... + \\frac{\\delta l}{\\delta z_i} \\frac{\\delta z_i}{\\delta x_n}$\n",
    "\n",
    "\n",
    "Noting that $\\frac{\\delta z_i}{\\delta x_n} = w_{in}$ allows us to have\n",
    "\n",
    "$\\frac{\\delta l}{\\delta \\vec{X}} = \\begin{bmatrix} \\frac{\\delta l}{\\delta x_1} & \\frac{\\delta l}{\\delta x_2} & \\cdots & \\frac{\\delta l}{\\delta x_n} \\end{bmatrix} = \\begin{bmatrix} \\frac{\\delta l}{\\delta z_1} & \\frac{\\delta l}{\\delta z_2} & \\cdots & \\frac{\\delta l}{\\delta z_n} \\end{bmatrix} \\begin{bmatrix} w_{11} & w_{12} & \\cdots & w_{1n} \\\\ w_{21} & w_{22} & \\cdots & w_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ w_{i1} & w_{i2} & \\cdots & w_{in} \\end{bmatrix}$\n",
    "\n",
    "## Note With Layer_Dense class\n",
    "The Layer_Dense class has the weights stored in the transposed fashion for forward propagation. Therefore, the weight matrix must be transposed for the backpropagation."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}