{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Principles into Practice\n",
"\n",
"Let's put the principles from last chapter into code. Here is the pseudocode:\n",
"\n",
"1. All of your import functions\n",
"2. Load data \n",
"3. Split your data into 2 subsamples: a \"test\" and \"train\" portion as [in this page](03_ML) or [this page](03c_ModelEval) or [this page](03c1_OOS). This is the first arrow in the picture below.[^pic] We will do all of our work on the \"train\" sample. \n",
"4. Before modelling, do EDA (**on the training data only!**)\n",
" - Sample basics: What is the unit of observation? What time spans are covered?\n",
" - Look for outliers, missing values, or data errors\n",
" - Note what variables are continuous or discrete numbers, which variables are categorical variables (and whether the categorical ordering is meaningful) \n",
" - You should read up on what all the variables mean from the documentation in the data folder.\n",
" - Visually explore the relationship between `v_Sale_Price` and other variables.\n",
" - For continuous variables - take note of whether the relationship seems linear or quadratic or polynomial\n",
" - For categorical variables - maybe try a box plot for the various levels?\n",
" - Now decide how you'd clean the data (imputing missing values, scaling variables, encoding categorical variables). These lessons will go into the preprocessing portion of your pipeline below. The [sklearn guide on preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) is very informative, as [this page **and the video I link to therein.**](04e1_preprocessing)\n",
" \n",
"5. Prepare to optimize a series of models ([covered here](04e_pipelines)) \n",
" 1. Set up one pipeline to clean each type of variable\n",
" 2. Combine those pipes into a \"preprocessing\" pipeline using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer)\n",
" 1. [Set up your cross validation method](04d_crossval):\n",
" - The picture below illustrates 5 folds apparently based on the row number.\n",
" - There are many [CV splitters](https://scikit-learn.org/stable/modules/classes.html#splitter-classes) available, including TimeSeriesSplit (a starting point for asset price predictions) and [GroupTimesSeriesSplit](https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243) is in development (which addresses a core problem with TimeSeriesSplit, and [is shown in practice here](https://www.kaggle.com/jorijnsmit/found-the-holy-grail-grouptimeseriessplit))\n",
" 1. Set up your scoring [metric](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) as discussed [here](03d_whatToMax)\n",
"5. [Optimize candidate model 1](04f_optimizing_a_model) _on the training data_\n",
" 1. Set up [a pipeline](04e_pipelines) that combines preprocessing, estimator\n",
" 1. Set up a hyper param grid\n",
" 1. Find optimal hyper params (e.g. gridsearchcv)\n",
" 1. Save pipeline with optimal params in place\n",
"6. Repeat step 6 for other candidate models\n",
"7. Compare all of the optimized models\n",
" ```python\n",
" # something like...\n",
" for model in models:\n",
" cross_validate(model, X, y,...)\n",
" ```\n",
"\n",
"---\n",
"\n",
"Here is that outline, but as a block of code you can use as a blueprint in projects:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import lots of functions\n",
"# load data \n",
"# split to test and train (link to split page/sk docs)\n",
"\n",
"## pre-modeling (on the training data only!)\n",
"\n",
"# do lots of EDA\n",
"# look for missing values, which variables are what type, and outliers \n",
"# figure out how you'd clean the data (imputation, scaling, encoding categorical vars)\n",
"# these lessons will go into the preprocessign portion of your pipeline \n",
"\n",
"## optimize a series of models \n",
"\n",
"# set up pipeline to clean each type of variable (1 pipe per var type)\n",
"# combine those pipes into \"preprocess\" pipe\n",
"# set up cv (can set up iterable to do OOS! or TimeSeriesSplit, or...)\n",
"# set up scoring \n",
"\n",
"## optimize candidate model type #1: \n",
"\n",
"# set up pipeline (combines preprocessing, estimator)\n",
"# set up hyper param grid\n",
"# find optimal hyper params (gridsearchcv)\n",
"# save pipeline with optimal params in place\n",
"# (Note: you should spend time interrogating model predictions, plotting and printing.\n",
"# Does the model struggle predicting certain obs? Excel at some?)\n",
"\n",
"## optimize candidate model type #2\n",
"\n",
"...\n",
"\n",
"## optimize candidate model type #N\n",
"\n",
"## compare the N optimized models\n",
"\n",
"# build list of models (each with own optimized hyperparams)\n",
"# for model in models:\n",
"# cross_validate(model, X, y,...)\n",
"# pick the winner!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"[^pic]: ![](img/feature_5_fold_cv.jpg)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}