Machine Learning 101 - Regression¶

How to Build and Interpret Regression Models¶

Author: Kris Barbier

Overview:¶

This notebook will demonstrate how to build and interpret 2 types of regression models: Linear Regression and Random Forests.

Regression Models Overview:¶

  • In supervised machine learning, there are two main tasks that can be completed. In this notebook, we will build and interpret regression models, which are used to predict continuous numerical values. We will use the following steps to complete our models:
    • Import needed libraries and read in data.
    • Quickly preprocess data for modeling (for an in depth look at preprocessing, check out the preprocessing notebook).
    • Use model pipelines to efficiently build 2 different types of regression models.
    • Evaluate models using different metrics to test each model's accuracy of predictions.

Regression Models in Code¶

Import Libraries and Read in Data¶

In [1]:
#Common imports for data science
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  #For visualizations
import seaborn as sns #For visualizations

#Imports for machine learning 
from sklearn.model_selection import train_test_split  #For validation split

#Imports for feature transformations
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

#Imports for building preprocessing object
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

#Imports for regression models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

#Imports for model metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#Set sklearn output to pandas
from sklearn import set_config
set_config(transform_output = 'pandas')

#Mute warnings
import warnings
warnings.filterwarnings('ignore')
In [2]:
#Read in sample dataset from repo folder
file_path = "Data/insurance_mod.csv"
df = pd.read_csv(file_path)
#Preview data
df.head()
Out[2]:
age sex bmi children smoker region charges
0 19 female 27.900 0 1 southwest 16885.0
1 18 male 33.770 1 0 southeast 1726.0
2 28 male 33.000 3 0 southeast 4449.0
3 33 male 22.705 0 0 northwest 21984.0
4 32 male 28.880 0 0 northwest 3867.0

Preprocess Data¶

  • In this step, we will go through the process of preprocessing data for this task. In this notebook, the steps will be condensed to save space. For an in-depth look at preprocessing, see the separate preprocessing notebook from this repo.
In [3]:
#Define X and y variables
y = df['charges']
X = df.drop(columns = 'charges')
In [4]:
#Perform validation split
#Setting a random state will make this reproducible in the future
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#Verify the split is correct
X_train.head()  #Note the absence of the charges column from the X_train data
Out[4]:
age sex bmi children smoker region
693 24 male 23.655 0 0 northwest
1297 28 female 26.510 2 0 southeast
634 51 male 39.700 1 0 southwest
1022 47 male 36.080 1 1 southeast
178 46 female 28.900 2 0 southwest
In [5]:
##Create numeric pipeline
#Define numeric columns
num_cols = X_train.select_dtypes('number').columns

#Instantiate transformers
impute_mean = SimpleImputer(strategy='mean')
scaler = StandardScaler()

#Set numeric pipeline
num_pipe = make_pipeline(impute_mean, scaler)

#Create tuple for column transformer
num_tuple = ("Numeric", num_pipe, num_cols)
In [6]:
##Create categorical pipeline
#Define categorical columns
cat_cols = X_train.select_dtypes('object').columns

#Instantiate transformers
impute_missing = SimpleImputer(strategy='constant', fill_value='Missing')
cat_encode = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

#Set categorical pipeline
cat_pipe = make_pipeline(impute_missing, cat_encode)

#Create tuple for column transformer
cat_tuple = ("Categorical", cat_pipe, cat_cols)
In [7]:
#Finalize preprocessing object
preprocessor = ColumnTransformer([num_tuple, cat_tuple], verbose_feature_names_out=False)
In [8]:
#Fit preprocessor on training data
preprocessor.fit(X_train)
Out[8]:
ColumnTransformer(transformers=[('Numeric',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer()),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 Index(['age', 'bmi', 'children', 'smoker'], dtype='object')),
                                ('Categorical',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='Missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 Index(['sex', 'region'], dtype='object'))],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('Numeric',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer()),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 Index(['age', 'bmi', 'children', 'smoker'], dtype='object')),
                                ('Categorical',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='Missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 Index(['sex', 'region'], dtype='object'))],
                  verbose_feature_names_out=False)
Index(['age', 'bmi', 'children', 'smoker'], dtype='object')
SimpleImputer()
StandardScaler()
Index(['sex', 'region'], dtype='object')
SimpleImputer(fill_value='Missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
In [9]:
#Transform training and testing data
X_train_tf = preprocessor.transform(X_train)
X_test_tf = preprocessor.transform(X_test)

#View preprocessed training data
X_train_tf.head()
Out[9]:
age bmi children smoker sex_female sex_male region_northeast region_northwest region_southeast region_southwest
693 -1.087167 -1.140875 -0.917500 -0.508399 0.0 1.0 0.0 1.0 0.0 0.0
1297 -0.802106 -0.665842 0.743605 -0.508399 1.0 0.0 0.0 0.0 1.0 0.0
634 0.836992 1.528794 -0.086947 -0.508399 0.0 1.0 0.0 0.0 0.0 1.0
1022 0.551932 0.926476 -0.086947 1.966960 0.0 1.0 0.0 0.0 1.0 0.0
178 0.480667 -0.268178 0.743605 -0.508399 1.0 0.0 0.0 0.0 0.0 1.0

Model 1: Linear Regression¶

  • Now that the data has been preprocessed, we will fit our first model, Linear Regression.
  • Linear Regression is a simple and easily evaluated model that aims to reduce the total squared errors produced from predictions. The result is a straight line based using the classic regression formula y = mx + b.
  • After the model is fit, we can determine the intercept of the line, as well as the coefficients produced for each feature of the X set of data. The model uses all of these coefficients and the intercept to do a multiple regression analysis of the data.
In [10]:
#Instantiate the model
lin_reg = LinearRegression()
In [11]:
#Fit the model onto the training data
lin_reg.fit(X_train_tf, y_train)
Out[11]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [12]:
#View the intercept of the line
lin_reg.intercept_
Out[12]:
3.75043980544144e+16
In [13]:
#Determine the coeffcients from the regression
lin_reg.coef_
Out[13]:
array([ 3.64288226e+03,  2.04174591e+03,  5.13908932e+02,  9.54795339e+03,
       -3.90284646e+16, -3.90284646e+16,  1.52406651e+15,  1.52406651e+15,
        1.52406651e+15,  1.52406651e+15])
In [14]:
#Get predictions for training data
y_pred_train = lin_reg.predict(X_train_tf)

#Get predictions for testing data
y_pred_test = lin_reg.predict(X_test_tf)
In [15]:
#View results of the testing predictions in a new data frame
results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_test, 
                        'Error': y_pred_test - y_test})
results
Out[15]:
Actual Predicted Error
764 9095.0 8960.0 -135.0
887 5272.0 7064.0 1792.0
890 29331.0 36904.0 7573.0
1293 9302.0 9488.0 186.0
259 33750.0 26928.0 -6822.0
... ... ... ...
342 13217.0 12808.0 -409.0
308 11945.0 14768.0 2823.0
1128 14358.0 7424.0 -6934.0
503 32548.0 25936.0 -6612.0
1197 5700.0 9136.0 3436.0

335 rows × 3 columns

  • In the error column, we can see that the values range from pretty large (in the thousands) to pretty small (in the hundreds), as well as both positive and negative error values. But this will not necessarily tell us how well our model did at predicting overall. We need to evaluate the model using different metrics to see how well the model actually performed.

Different Metrics Explained¶

  • There are 4 common metrics that can be used to evaluate the model:
    • Mean Absolute Error (MAE): When measuring the mean absolute error, the absolute value of every error value is added together and then divided by the total number of errors, giving the mean. A lower MAE value equates to a better performing model. This is an intuitive metric to understand and explain to stakeholders.
    • Mean Squared Error (MSE): Mean squared error calculates the same way as MAE, but instead of taking the absolute value of each error, it squares them instead. This allows for the metric to penalize the model for larger errors, but gives values that are in squared units, instead of the same units as the problem, which makes it less intuitive of a metric to interpret and explain. As with MAE, a lower MSE equates to a better performing model.
    • Root Mean Squared Error (RMSE): Root mean squared error is the result of taking the square root of the MSE value. This allows for penalization of large errors that the model makes, but returns values in the same units as the original problem. RMSE is easier to explain to stakeholders than MSE. Lower RMSE scores are better.
    • R-Squared (R2): The R2 score will tell us the percentage of variance that is explained in the model. Usually, the higher the R2 the better, because more variance can be explained. This metric is not very intuitive, especially for a non-technical audience.
In [16]:
##Using code to derive metrical evaluation
#Calculating MAE
train_MAE = mean_absolute_error(y_train, y_pred_train)
test_MAE = mean_absolute_error(y_test, y_pred_test)
print(f'Model Training MAE: {train_MAE:,.2f}')
print(f'Model Testing MAE: {test_MAE:,.2f}')
Model Training MAE: 4,180.30
Model Testing MAE: 4,239.17
In [17]:
#Calculating MSE
train_MSE = mean_squared_error(y_train, y_pred_train)
test_MSE = mean_squared_error(y_test, y_pred_test)
print(f'Model Training MSE: {train_MSE:,.2f}')
print(f'Model Testing MSE: {test_MSE:,.2f}')
Model Training MSE: 37,004,850.50
Model Testing MSE: 35,103,353.69
In [18]:
#Calculating RMSE
train_RMSE = mean_squared_error(y_train, y_pred_train, squared=False)
test_RMSE = mean_squared_error(y_test, y_pred_test, squared=False)
print(f'Model Training RMSE: {train_RMSE:,.2f}')
print(f'Model Testing RMSE: {test_RMSE:,.2f}')
Model Training RMSE: 6,083.16
Model Testing RMSE: 5,924.81
In [19]:
#Calcuating R2
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
print(f'Model Training R2: {train_r2:.2f}')
print(f'Model Testing R2: {test_r2:.2f}')
Model Training R2: 0.74
Model Testing R2: 0.77
In [20]:
#View training metrics altogether
print(f'Model Training MAE: {train_MAE:,.2f}')
print(f'Model Training MSE: {train_MSE:,.2f}')
print(f'Model Training RMSE: {train_RMSE:,.2f}')
print(f'Model Training R2: {train_r2:.2f}')
Model Training MAE: 4,180.30
Model Training MSE: 37,004,850.50
Model Training RMSE: 6,083.16
Model Training R2: 0.74
In [21]:
#View testing metrics altogether
print(f'Model Testing MAE: {test_MAE:,.2f}')
print(f'Model Testing MSE: {test_MSE:,.2f}')
print(f'Model Testing RMSE: {test_RMSE:,.2f}')
print(f'Model Testing R2: {test_r2:.2f}')
Model Testing MAE: 4,239.17
Model Testing MSE: 35,103,353.69
Model Testing RMSE: 5,924.81
Model Testing R2: 0.77

Linear Regression Metrics Interpreted¶

  • For this basic linear regression model, we can interpret the metrics used above like so:
    • The model returned an MAE of \$4,239.17. This means the model made an average error of approximately 4 thousand dollars.
    • The MSE returned is 35,103,353.69 sqaured dollars. Because this value is very high, this lets us know that there are some large errors that occurred in the model's predictions.
    • The RMSE is \$5,924.81, which means the average error was around 5 thousand dollars.
    • With an R2 score of 0.77, this model can explain about 77% of the variance that occurs in the predictions. This score is not too bad!

Model 2: Random Forest¶

  • Our second regression model will be a random forest. This is known as an ensemble type algorithm, as it uses many singular models together in order to make more accurate predictions.
In [22]:
#Instantiate the model
random_forest = RandomForestRegressor()
In [23]:
#Fit the model on the training data
random_forest.fit(X_train_tf, y_train)
Out[23]:
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor()
In [24]:
#Get predictions for training data
y_pred_train = random_forest.predict(X_train_tf)

#Get predictions for testing data
y_pred_test = random_forest.predict(X_test_tf)
In [25]:
#Evaluate model using MAE
train_MAE = mean_absolute_error(y_train, y_pred_train)
test_MAE = mean_absolute_error(y_test, y_pred_test)
print(f'Model Training MAE: {train_MAE:,.2f}')
print(f'Model Testing MAE: {test_MAE:,.2f}')
Model Training MAE: 1,021.42
Model Testing MAE: 2,596.28
In [26]:
#Evaluate model using MSE
train_MSE = mean_squared_error(y_train, y_pred_train)
test_MSE = mean_squared_error(y_test, y_pred_test)
print(f'Model Training MSE: {train_MSE:,.2f}')
print(f'Model Testing MSE: {test_MSE:,.2f}')
Model Training MSE: 3,506,607.41
Model Testing MSE: 22,616,773.35
In [27]:
#Evaluate model using RMSE
train_RMSE = mean_squared_error(y_train, y_pred_train, squared=False)
test_RMSE = mean_squared_error(y_test, y_pred_test, squared=False)
print(f'Model Training RMSE: {train_RMSE:,.2f}')
print(f'Model Testing RMSE: {test_RMSE:,.2f}')
Model Training RMSE: 1,872.59
Model Testing RMSE: 4,755.71
In [28]:
#Evaluate model using R2
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
print(f'Model Training R2: {train_r2:.2f}')
print(f'Model Testing R2: {test_r2:.2f}')
Model Training R2: 0.98
Model Testing R2: 0.85

Random Forest Metrics Interpreted¶

  • The MAE of the random forest was \$2,596.28, which is a lower score than our linear regression model.
  • The MSE was \$22,616,773.35, which again is lower than the linear regression model. This still shows some potentially large errors in the model.
  • The RMSE is \$4,755.71, again coming in lower than the linear regression model.
  • Finally, the R2 for this model is 0.85, which means it can explain about 85% of the variance in the predictions.
  • Overall, the random forest model performed better than the linear regression model.

Conclusions¶

  • In this notebook, we looked at how to build, run, and interpret 2 types of regression machine learning algorithms. After running both models, we compared the evaluations of both models to determine which model had better predictive power for this data set.
    • We imported new packages for regression algorithms and metrics.
    • We quickly preprocessed the data for machine learning.
    • We built a linear regression model and evaluated the model using 4 different metrics.
    • We did the same with a random forest model, and compared the 2 models to determine that a random forest model is more accurate at predicting the target variable of this data set.
In [ ]: