Lecture 29. K-Fold Cross Validation

2025-01-27 02:33:44 +00:00 · 2025-01-27 02:33:44 +00:00 · 95127b5eb4
commit 95127b5eb4
parent a95ddb61b8
3 changed files with 54 additions and 0 deletions
--- a/lecture28_31/notes_28.ipynb
+++ b/lecture28_31/notes_28.ipynb
@ -435,6 +435,35 @@
    "\n",
    "These \"hyper-parameters\" can be adjusted after testing with out of sample data."
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Types of Data\n",
+    "Training data is used to optimize a given network and minimize loss.\n",
+    "\n",
+    "Validation data is used to optimize hyper-parameters of the network. Parameters like number of layers, neurons, neurons per layer, activation layer and their constants, epochs, learning rates, etc.\n",
+    "\n",
+    "Testing data is used to see the out of sample effectiveness of the trained network.\n",
+    "\n",
+    "## Splitting up Data\n",
+    "### Given a Lot of Data\n",
+    "Dataset is broken primarily into training data and then also validation and testing data.\n",
+    "\n",
+    "### Given Limited Data\n",
+    "Dataset is only broken up into training data and testing data (maybe 80%-20%). K-Fold validation can be used in limited dataset scenarios.\n",
+    "\n",
+    "# K-Fold Cross Validation\n",
+    "In the limited training dataset, you can split the dataset further into subsections, say 5. You then have 5 different combinations of data, where 1 subsection is considered the validation while the other 4 are considered the training data.\n",
+    "\n",
+    "When using 5 subsections of the training data, say {A, B, C, D, E}, you get 5 validation losses. The total validation loss is considered the average of all 5.\n",
+    "\n",
+    "For determining different hyper-parameters, you run the network on the same training data and choose the one with the lowest total validation loss.\n",
+    "\n",
+    "## Data Leakage\n",
+    "While K-Fold is good for getting hyper-parameters with limited data, it can have data leakage if not correctly setup. For example, with timeseries data, it may get access to future information and train off of that."
+   ]
  }
 ],
 "metadata": {
--- a/lecture28_31/notes_28.pdf
+++ b/lecture28_31/notes_28.pdf
--- a/lecture28_31/notes_28.py
+++ b/lecture28_31/notes_28.py
@ -372,4 +372,29 @@ print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')
 # 
 # These "hyper-parameters" can be adjusted after testing with out of sample data.

+# %% [markdown]
+# # Types of Data
+# Training data is used to optimize a given network and minimize loss.
+# 
+# Validation data is used to optimize hyper-parameters of the network. Parameters like number of layers, neurons, neurons per layer, activation layer and their constants, epochs, learning rates, etc.
+# 
+# Testing data is used to see the out of sample effectiveness of the trained network.
+# 
+# ## Splitting up Data
+# ### Given a Lot of Data
+# Dataset is broken primarily into training data and then also validation and testing data.
+# 
+# ### Given Limited Data
+# Dataset is only broken up into training data and testing data (maybe 80%-20%). K-Fold validation can be used in limited dataset scenarios.
+# 
+# # K-Fold Cross Validation
+# In the limited training dataset, you can split the dataset further into subsections, say 5. You then have 5 different combinations of data, where 1 subsection is considered the validation while the other 4 are considered the training data.
+# 
+# When using 5 subsections of the training data, say {A, B, C, D, E}, you get 5 validation losses. The total validation loss is considered the average of all 5.
+# 
+# For determining different hyper-parameters, you run the network on the same training data and choose the one with the lowest total validation loss.
+# 
+# ## Data Leakage
+# While K-Fold is good for getting hyper-parameters with limited data, it can have data leakage if not correctly setup. For example, with timeseries data, it may get access to future information and train off of that.
+