diff --git a/lecture28_31/notes_28.ipynb b/lecture28_31/notes_28.ipynb index 5caaf9d..c07907b 100644 --- a/lecture28_31/notes_28.ipynb +++ b/lecture28_31/notes_28.ipynb @@ -435,6 +435,35 @@ "\n", "These \"hyper-parameters\" can be adjusted after testing with out of sample data." ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Types of Data\n", + "Training data is used to optimize a given network and minimize loss.\n", + "\n", + "Validation data is used to optimize hyper-parameters of the network. Parameters like number of layers, neurons, neurons per layer, activation layer and their constants, epochs, learning rates, etc.\n", + "\n", + "Testing data is used to see the out of sample effectiveness of the trained network.\n", + "\n", + "## Splitting up Data\n", + "### Given a Lot of Data\n", + "Dataset is broken primarily into training data and then also validation and testing data.\n", + "\n", + "### Given Limited Data\n", + "Dataset is only broken up into training data and testing data (maybe 80%-20%). K-Fold validation can be used in limited dataset scenarios.\n", + "\n", + "# K-Fold Cross Validation\n", + "In the limited training dataset, you can split the dataset further into subsections, say 5. You then have 5 different combinations of data, where 1 subsection is considered the validation while the other 4 are considered the training data.\n", + "\n", + "When using 5 subsections of the training data, say {A, B, C, D, E}, you get 5 validation losses. The total validation loss is considered the average of all 5.\n", + "\n", + "For determining different hyper-parameters, you run the network on the same training data and choose the one with the lowest total validation loss.\n", + "\n", + "## Data Leakage\n", + "While K-Fold is good for getting hyper-parameters with limited data, it can have data leakage if not correctly setup. For example, with timeseries data, it may get access to future information and train off of that." + ] } ], "metadata": { diff --git a/lecture28_31/notes_28.pdf b/lecture28_31/notes_28.pdf index 6d072b0..3703e56 100644 Binary files a/lecture28_31/notes_28.pdf and b/lecture28_31/notes_28.pdf differ diff --git a/lecture28_31/notes_28.py b/lecture28_31/notes_28.py index dab54d0..d4197a3 100644 --- a/lecture28_31/notes_28.py +++ b/lecture28_31/notes_28.py @@ -372,4 +372,29 @@ print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}') # # These "hyper-parameters" can be adjusted after testing with out of sample data. +# %% [markdown] +# # Types of Data +# Training data is used to optimize a given network and minimize loss. +# +# Validation data is used to optimize hyper-parameters of the network. Parameters like number of layers, neurons, neurons per layer, activation layer and their constants, epochs, learning rates, etc. +# +# Testing data is used to see the out of sample effectiveness of the trained network. +# +# ## Splitting up Data +# ### Given a Lot of Data +# Dataset is broken primarily into training data and then also validation and testing data. +# +# ### Given Limited Data +# Dataset is only broken up into training data and testing data (maybe 80%-20%). K-Fold validation can be used in limited dataset scenarios. +# +# # K-Fold Cross Validation +# In the limited training dataset, you can split the dataset further into subsections, say 5. You then have 5 different combinations of data, where 1 subsection is considered the validation while the other 4 are considered the training data. +# +# When using 5 subsections of the training data, say {A, B, C, D, E}, you get 5 validation losses. The total validation loss is considered the average of all 5. +# +# For determining different hyper-parameters, you run the network on the same training data and choose the one with the lowest total validation loss. +# +# ## Data Leakage +# While K-Fold is good for getting hyper-parameters with limited data, it can have data leakage if not correctly setup. For example, with timeseries data, it may get access to future information and train off of that. +