Lecture 29. K-Fold Cross Validation

2025-01-27 02:33:44 +00:00 · 2025-01-27 02:33:44 +00:00 · 95127b5eb4
commit 95127b5eb4
parent a95ddb61b8
3 changed files with 54 additions and 0 deletions
--- a/lecture28_31/notes_28.ipynb
+++ b/lecture28_31/notes_28.ipynb
@ -435,6 +435,35 @@
    "\n",
    "These \"hyper-parameters\" can be adjusted after testing with out of sample data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Types of Data\n",
    "Training data is used to optimize a given network and minimize loss.\n",
    "\n",
    "Validation data is used to optimize hyper-parameters of the network. Parameters like number of layers, neurons, neurons per layer, activation layer and their constants, epochs, learning rates, etc.\n",
    "\n",
    "Testing data is used to see the out of sample effectiveness of the trained network.\n",
    "\n",
    "## Splitting up Data\n",
    "### Given a Lot of Data\n",
    "Dataset is broken primarily into training data and then also validation and testing data.\n",
    "\n",
    "### Given Limited Data\n",
    "Dataset is only broken up into training data and testing data (maybe 80%-20%). K-Fold validation can be used in limited dataset scenarios.\n",
    "\n",
    "# K-Fold Cross Validation\n",
    "In the limited training dataset, you can split the dataset further into subsections, say 5. You then have 5 different combinations of data, where 1 subsection is considered the validation while the other 4 are considered the training data.\n",
    "\n",
    "When using 5 subsections of the training data, say {A, B, C, D, E}, you get 5 validation losses. The total validation loss is considered the average of all 5.\n",
    "\n",
    "For determining different hyper-parameters, you run the network on the same training data and choose the one with the lowest total validation loss.\n",
    "\n",
    "## Data Leakage\n",
    "While K-Fold is good for getting hyper-parameters with limited data, it can have data leakage if not correctly setup. For example, with timeseries data, it may get access to future information and train off of that."
   ]
  }
 ],
 "metadata": {
--- a/lecture28_31/notes_28.pdf
+++ b/lecture28_31/notes_28.pdf
--- a/lecture28_31/notes_28.py
+++ b/lecture28_31/notes_28.py
@ -372,4 +372,29 @@ print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')
 # 
 # These "hyper-parameters" can be adjusted after testing with out of sample data.
 # %% [markdown]
 # # Types of Data
 # Training data is used to optimize a given network and minimize loss.
 # 
 # Validation data is used to optimize hyper-parameters of the network. Parameters like number of layers, neurons, neurons per layer, activation layer and their constants, epochs, learning rates, etc.
 # 
 # Testing data is used to see the out of sample effectiveness of the trained network.
 # 
 # ## Splitting up Data
 # ### Given a Lot of Data
 # Dataset is broken primarily into training data and then also validation and testing data.
 # 
 # ### Given Limited Data
 # Dataset is only broken up into training data and testing data (maybe 80%-20%). K-Fold validation can be used in limited dataset scenarios.
 # 
 # # K-Fold Cross Validation
 # In the limited training dataset, you can split the dataset further into subsections, say 5. You then have 5 different combinations of data, where 1 subsection is considered the validation while the other 4 are considered the training data.
 # 
 # When using 5 subsections of the training data, say {A, B, C, D, E}, you get 5 validation losses. The total validation loss is considered the average of all 5.
 # 
 # For determining different hyper-parameters, you run the network on the same training data and choose the one with the lowest total validation loss.
 # 
 # ## Data Leakage
 # While K-Fold is good for getting hyper-parameters with limited data, it can have data leakage if not correctly setup. For example, with timeseries data, it may get access to future information and train off of that.