Linear Regression: Theory, Code, Assumptions & Metrics

Linear regression model with Python code chart and data analysis for beginners

🧠 Introduction

Understanding linear regression is like learning the grammar of data science. It’s foundational and widely used in many domains — from finance and healthcare to marketing analytics and AI. In this tutorial, you’ll learn the theory, implementation, and evaluation of linear regression using Python, while addressing real-world concerns like assumptions, interpretation, and optimisation.

📈 What is Linear Regression?

Linear regression is a supervised machine learning algorithm used for predicting a continuous value. The model tries to draw a straight line through the data that best represents the relationship between input variables (X) and an output variable (Y).

🗣️ Example: Predicting house prices based on area, number of bedrooms, etc.

🌍 Real-world Applications of Linear Regression

Finance: Predicting stock prices or credit risk.
Healthcare: Estimating recovery time from treatment.
Marketing: Forecasting sales based on ad spend.
Education: Predicting student performance.

🧪 Theory Behind Linear Regression

📐 Line of Best Fit

We represent the regression line as:

y = \beta_0 + \beta_1x + \epsilon

Where:

$\beta_0$ is the intercept,
$\beta_1$ is the slope,
$\epsilon$ is the error term.

🧮 Cost Function – Mean Squared Error (MSE)

The cost function quantifies how well the line fits the data:

MSE = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Lower MSE = Better model fit.

🧗 Gradient Descent Optimisation

A technique to minimise the cost function by updating parameters iteratively:

theta = theta - alpha * gradient

Where alpha is the learning rate.

⚖️ Assumptions of Linear Regression

✅ These must be met to ensure the reliability of your model.

Linearity – Relationship between X and Y is linear.
Independence – Observations are independent of each other.
Homoscedasticity – Equal variance of errors.
Normality of errors – Residuals follow a normal distribution.
No multicollinearity – Independent variables are not correlated.

🧑‍💻 Step-by-step Implementation in Python

📦 Prerequisites and Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

📂 Loading and Understanding the Dataset

We'll use a sample dataset from Scikit-learn or a CSV:

df = pd.read_csv('house_prices.csv')
print(df.head())

📊 Exploratory Data Analysis (EDA)

sns.scatterplot(data=df, x='Area', y='Price')
plt.title("Area vs Price")
plt.show()

🧠 Training the Linear Regression Model

X = df[['Area']]
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

📉 Visualising the Regression Line

y_pred = model.predict(X_test)

plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red')
plt.title('Regression Line')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

📊 Evaluating the Model

📐 R-squared (Coefficient of Determination)

r2_score(y_test, y_pred)

🔹 Indicates how much variation in Y is explained by X (ranges from 0 to 1).

🔢 Mean Absolute Error (MAE)

mean_absolute_error(y_test, y_pred)

🔹 Average of absolute errors.

🔢 Mean Squared Error (MSE)

mean_squared_error(y_test, y_pred)

🔢 Root Mean Squared Error (RMSE)

np.sqrt(mean_squared_error(y_test, y_pred))

🎓 Expert Opinions and Best Practices

According to Dr. Sebastian Raschka (AI/ML Professor and Author):

"Linear regression is an interpretable baseline model that should always be the first attempt before trying complex algorithms."

Best practices:

Always perform EDA before modelling.
Check for assumptions using residual plots.
Use regularisation (Ridge or Lasso) if overfitting is detected.
Prefer scaled inputs for better convergence.

⚠️ Effects and Pitfalls

Issue	Effect on Model
Multicollinearity	Inflated variance, unstable coefficients
Outliers	Skewed results
Non-linearity	Inaccurate predictions
Violated assumptions	Misleading inferences

🔧 Solutions:
Use transformations, feature selection, and robust regression techniques.

💡 Final Thoughts and Suggestions

Linear regression may seem basic, but it holds strong for problems where interpretability and simplicity matter. It provides an excellent starting point for data exploration and benchmarking.

✅ Actionable Suggestions

Use statsmodels for detailed statistical outputs.
For multiple regression, always check VIF (Variance Inflation Factor).
Use cross-validation for better generalisation.

📚 Additional Learning Resources

Book: “Introduction to Statistical Learning” by James et al.
Course: Andrew Ng’s Machine Learning Specialisation (Coursera)
Tool: Google Colab – for cloud-based Python execution.

📝 Disclaimer

While I am not a certified machine learning engineer or data scientist, I have thoroughly researched this topic using trusted academic sources, official documentation, expert insights, and widely accepted industry practices to compile this guide. This post is intended to support your learning journey by offering helpful explanations and practical examples. However, for high-stakes projects or professional deployment scenarios, consulting experienced ML professionals or domain experts is strongly recommended.
Your suggestions and views on machine learning are welcome—please share them below!

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Linear Regression Explained with Python Code – Theory, Assumptions, Implementation, and Evaluation",
  "description": "Learn linear regression with Python from theory to code and evaluation in a beginner friendly way",
  "author": {
    "@type": "Person",
    "name": "Rajiv Dhiman"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Focus360Blog",
    "logo": {
      "@type": "ImageObject",
      "url": "https://www.focus360blog.online/images/logo.png"
    }
  },
  "datePublished": "2025-06-05",
  "dateModified": "2025-06-05",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://www.focus360blog.online/2025/06/linear-regression-explained-with-python.html"
  }
}

🏠

Previous Post 👉 Types of Machine Learning Algorithms – Overview of classification, regression, clustering, and dimensionality reduction

Next Post 👉 Logistic Regression for Classification Problems – Use cases, ROC curve, confusion matrix

Linear Regression: Theory, Code, Assumptions & Metrics

🧠 Introduction

📈 What is Linear Regression?

🌍 Real-world Applications of Linear Regression

🧪 Theory Behind Linear Regression

📐 Line of Best Fit

🧮 Cost Function – Mean Squared Error (MSE)

🧗 Gradient Descent Optimisation

⚖️ Assumptions of Linear Regression

🧑‍💻 Step-by-step Implementation in Python

📦 Prerequisites and Libraries

📂 Loading and Understanding the Dataset

📊 Exploratory Data Analysis (EDA)

🧠 Training the Linear Regression Model

📉 Visualising the Regression Line

📊 Evaluating the Model

📐 R-squared (Coefficient of Determination)

🔢 Mean Absolute Error (MAE)

🔢 Mean Squared Error (MSE)

🔢 Root Mean Squared Error (RMSE)

🎓 Expert Opinions and Best Practices

⚠️ Effects and Pitfalls

💡 Final Thoughts and Suggestions

✅ Actionable Suggestions

📚 Additional Learning Resources

📝 Disclaimer

Post a Comment

Wonderchef Automatic Cooker

Mini Gadget – Essential Daily Use

Smart Home Tool

Deep Learning with TensorFlow and Keras – Master ANN, CNN & RNN

Categories

Get new posts by email:

Total Pageviews

Popular Posts

Deep Learning with TensorFlow and Keras – Master ANN, CNN & RNN

Model Evaluation: Accuracy, Precision, Recall, F1 & AUC

Mastering RNN and LSTMs for Time-Series Forecasting

Contact Form