Build a Linear Regression Model

Jishnu Prasad Samal Build a Linear Regression Model

Sunday, 09 April, 2023 6 Minutes Read

Linear Regression is one of the oldest and widely used Machine Learning algorithm which is used to train a model against two variables - Independent Variable and Dependent (Target) Variable. If you wish to learn more about AI and Machine Learning, you may see my blog on Artificial Intelligence.

In this project, I will be training a model to predict Sports Sustainability. Sustainability in sports means conducting a sporting event that utilises environmentally friendly methods to reduce the negative impact on the environment. Just like every industry, sports has a supply chain issue. When enjoying sports like Cricket or Football, we tend to forget about environment.

But we need to think beyond the tournament: who is making players’ kits and boots? Where is the water feeding the pitch coming from? How was the stadium built, and how is it maintained? What’s the impact of major tournaments like the Champions’ League or the Olympics, where hastily erected stadiums and hundreds of thousands of fans take over local areas?. Moreover, are certain actions – like disposable cups in stadiums, or the use of recycled fibres in kits – simply a sticking plaster over much wider issues across the entire supply chain?

For this project we have a dataset containing data about the number of suppliers of sports goods and the corresponding carbon emissions from them (in metric tons). I am going to use Scikit-learn for training and evaluating the model. Before building the model, we need to analyse the data, for any errors or ambiguities, and identify the trends in the data. I will be using Pandas and Numpy for data analysis and Matplotlib as plotting library for plotting graphs and charts. So, without wasting any further time, let's jump right in.

Bootstrapping

First, let's install and import the required packages as shown in the codeblock below.

%pip install pandas numpy matplotlib

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Data Analysis with Pandas

With the project dependencies installed, let's move forward to import our dataset using pandas. In the first line, I have created the dataframe by importing dataset with pd.read_csv() function and in the second line, I dropped the year column as it does not provide any relevant information for our model. This is an example dataset and has only 16 data points, but production models are trained on much larger datasets containing hundreds and thousands of data points.

Then, we will get the columns using df.columns and the shape of the dataset using df.shape.

Then, we will get information about the dataset using df.info(). It gives us the columns of the dataset, non-null count which means the number of values in the column which are not null and the data type of the values in the column.

Then, I am going to analyze the mean, median, standard deviation, count, min and max and several percentages of the data using df.describe().

Now, plot a scatter plot for the data points using pandas.

Model Training

Now, we have completed analyzing the data, so, we will begin with model training. Before starting with model training, we are going to import the required modules.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error import joblib

Now, we need to initialize two variables - x which is our independent variable and Y which is our dependent or target variable. x contains the Number of Suppliers feature and Y contains the Carbon Emissions from Suppliers (metric tons) which we need to predict.

Now, we will split our dataset into training data and testing data. In the code block below, we are x to a numpy array and then assigning one part to x_train which is our training data from x variable and another part to y_train which is our training data from Y variable. Similarly, we are creating x_test and y_test which holds our testing data. The size of testing dataframe is 0.2 or approximately 20% of the original dataset.

Now, finally, it's time to fit the data into the model. To do so, first, we need to initialize the LinearRegression() class imported from scikit-learn with a model variable. Then we use model.fit() to fit the data and our model is ready to use.

Now, let's make predictions using the model built in the previous step. Here, we need to pass the Number of Suppliers as a numpy array into the model.predict() function. And we get the Carbon Emission in metric tons as our output.

Model Evaluation

We have successfully built our model from scratch in the previous step. But, now, we need to evaluate our model's accuracy and performance. This is a very crucial step in Machine Learning Lifecycle. So, now let's begin with model evaluation. In model training stage, we created x_test and y_test and now we will be using those two to test the model.

I will be creating an array named y_pred which will contain predicted values of data points of x_test.

We will be using score, intercept, coefficient and R² score. The score of the model is 99.671% at the time of publishing this blog, which is quite good. R² score is quite fine.

Now, I am going to compare the values of y_test which is the actual value of the data points and the values of y_pred which are the predicted values. I will do this by plot a graph of Actual vs Predicted values.

Saving the Model

Now, we need to save our trained model for future use. We will pickle the model using joblib package.

def save_model(model):
    joblib.dump(model, open('model.jlib', 'wb+'))

save_model(model)

Now, we have saved our model. Let's try out the saved model by loading it as saved_model and then make predictions using the saved model.

saved_model = joblib.load('model.jlib')

As we can see above, the saved model gives the same output as the original model.

Final Thoughts

In this blog, I demonstrated how to build a Linear Regression model to predict the sustainability in sports. We used Pandas and Matplotlib to analyze the data and then used Scikit-learn to train the model. After Model Evaluation, the last step is Model Deployment. I will demonstrate deploying ML models on Hugging Face Spaces using Gradio Framework in some other blog. I have already deployed this model, so if you want to try it out you may check out the link below.

Deployed Model - https://jishnupsamal-sports-sustainability.hf.space