Build a Linear Regression Model
Jishnu Prasad SamalLinear Regression is one of the oldest and widely used Machine Learning algorithm which is used to train a model against two variables - Independent Variable and Dependent (Target) Variable. If you wish to learn more about AI and Machine Learning, you may see my blog on Artificial Intelligence.
In this project, I will be training a model to predict Sports Sustainability. Sustainability in sports means conducting a sporting event that utilises environmentally friendly methods to reduce the negative impact on the environment. Just like every industry, sports has a supply chain issue. When enjoying sports like Cricket or Football, we tend to forget about environment.
But we need to think beyond the tournament: who is making players’ kits and boots? Where is the water feeding the pitch coming from? How was the stadium built, and how is it maintained? What’s the impact of major tournaments like the Champions’ League or the Olympics, where hastily erected stadiums and hundreds of thousands of fans take over local areas?. Moreover, are certain actions – like disposable cups in stadiums, or the use of recycled fibres in kits – simply a sticking plaster over much wider issues across the entire supply chain?
For this project we have a dataset containing data about the number of suppliers of sports goods and the corresponding carbon emissions from them (in metric tons). I am going to use Scikit-learn for training and evaluating the model. Before building the model, we need to analyse the data, for any errors or ambiguities, and identify the trends in the data. I will be using Pandas and Numpy for data analysis and Matplotlib as plotting library for plotting graphs and charts. So, without wasting any further time, let's jump right in.
Bootstrapping
First, let's install and import the required packages as shown in the codeblock below.
%pip install pandas numpy matplotlib
import pandas as pdimport numpy as npimport matplotlib.pyplot as plt
Data Analysis with Pandas
With the project dependencies installed, let's move forward to import our dataset using pandas. In the first line, I have created the dataframe by importing dataset with pd.read_csv()
function and in the second line, I dropped the year column as it does not provide any relevant information for our model. This is an example dataset and has only 16 data points, but production models are trained on much larger datasets containing hundreds and thousands of data points.
Then, we will get the columns using df.columns
and the shape of the dataset using df.shape
.
Then, we will get information about the dataset using df.info()
. It gives us the columns of the dataset, non-null count which means the number of values in the column which are not null and the data type of the values in the column.
Then, I am going to analyze the mean, median, standard deviation, count, min and max and several percentages of the data using df.describe()
.
Now, plot a scatter plot for the data points using pandas.
Model Training
Now, we have completed analyzing the data, so, we will begin with model training. Before starting with model training, we are going to import the required modules.
from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error import joblib
Now, we need to initialize two variables - x
which is our independent variable and Y
which is our dependent or target variable. x
contains the Number of Suppliers
feature and Y
contains the Carbon Emissions from Suppliers (metric tons)
which we need to predict.
Now, we will split our dataset into training data and testing data. In the code block below, we are x
to a numpy
array and then assigning one part to x_train
which is our training data from x
variable and another part to y_train
which is our training data from Y
variable. Similarly, we are creating x_test
and y_test
which holds our testing data. The size of testing dataframe is 0.2
or approximately 20%
of the original dataset.
Now, finally, it's time to fit the data into the model. To do so, first, we need to initialize the LinearRegression()
class imported from scikit-learn
with a model
variable. Then we use model.fit()
to fit the data and our model is ready to use.
Now, let's make predictions using the model built in the previous step. Here, we need to pass the Number of Suppliers
as a numpy
array into the model.predict()
function. And we get the Carbon Emission in metric tons
as our output.
Model Evaluation
We have successfully built our model from scratch in the previous step. But, now, we need to evaluate our model's accuracy and performance. This is a very crucial step in Machine Learning Lifecycle. So, now let's begin with model evaluation. In model training stage, we created x_test
and y_test
and now we will be using those two to test the model.
I will be creating an array named y_pred
which will contain predicted values of data points of x_test
.
We will be using score, intercept, coefficient and R² score. The score of the model is 99.671% at the time of publishing this blog, which is quite good. R² score is quite fine.
Now, I am going to compare the values of y_test
which is the actual value of the data points and the values of y_pred
which are the predicted values. I will do this by plot a graph of Actual vs Predicted values.
Saving the Model
Now, we need to save our trained model for future use. We will pickle the model using joblib
package.
def save_model(model):joblib.dump(model, open('model.jlib', 'wb+'))save_model(model)
Now, we have saved our model. Let's try out the saved model by loading it as saved_model
and then make predictions using the saved model.
saved_model = joblib.load('model.jlib')
As we can see above, the saved model gives the same output as the original model.
Final Thoughts
In this blog, I demonstrated how to build a Linear Regression model to predict the sustainability in sports. We used Pandas and Matplotlib to analyze the data and then used Scikit-learn to train the model. After Model Evaluation, the last step is Model Deployment. I will demonstrate deploying ML models on Hugging Face Spaces using Gradio Framework in some other blog. I have already deployed this model, so if you want to try it out you may check out the link below.
Deployed Model - https://jishnupsamal-sports-sustainability.hf.space