Linear Regression: Theory, Code, Assumptions & Metrics

Linear regression model with Python code chart and data analysis for beginners

🧠 Introduction

Understanding linear regression is like learning the grammar of data science. It’s foundational and widely used in many domains — from finance and healthcare to marketing analytics and AI. In this tutorial, you’ll learn the theoryimplementation, and evaluation of linear regression using Python, while addressing real-world concerns like assumptions, interpretation, and optimisation.

📈 What is Linear Regression?

Linear regression is a supervised machine learning algorithm used for predicting a continuous value. The model tries to draw a straight line through the data that best represents the relationship between input variables (X) and an output variable (Y).

🗣️ Example: Predicting house prices based on area, number of bedrooms, etc.

🌍 Real-world Applications of Linear Regression

  • Finance: Predicting stock prices or credit risk.

  • Healthcare: Estimating recovery time from treatment.

  • Marketing: Forecasting sales based on ad spend.

  • Education: Predicting student performance.

🧪 Theory Behind Linear Regression

📐 Line of Best Fit

We represent the regression line as:

y=β0+β1x+ϵy = \beta_0 + \beta_1x + \epsilon

Where:

  • β0\beta_0 is the intercept,

  • β1\beta_1 is the slope,

  • ϵ\epsilon is the error term.

🧮 Cost Function – Mean Squared Error (MSE)

The cost function quantifies how well the line fits the data:

MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Lower MSE = Better model fit.

🧗 Gradient Descent Optimisation

A technique to minimise the cost function by updating parameters iteratively:

theta = theta - alpha * gradient

Where alpha is the learning rate.

⚖️ Assumptions of Linear Regression

✅ These must be met to ensure the reliability of your model.
  1. Linearity – Relationship between X and Y is linear.

  2. Independence – Observations are independent of each other.

  3. Homoscedasticity – Equal variance of errors.

  4. Normality of errors – Residuals follow a normal distribution.

  5. No multicollinearity – Independent variables are not correlated.

🧑‍💻 Step-by-step Implementation in Python

📦 Prerequisites and Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

📂 Loading and Understanding the Dataset

We'll use a sample dataset from Scikit-learn or a CSV:

df = pd.read_csv('house_prices.csv')
print(df.head())

📊 Exploratory Data Analysis (EDA)

sns.scatterplot(data=df, x='Area', y='Price')
plt.title("Area vs Price")
plt.show()

🧠 Training the Linear Regression Model

X = df[['Area']]
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

📉 Visualising the Regression Line

y_pred = model.predict(X_test)

plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red')
plt.title('Regression Line')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

📊 Evaluating the Model

📐 R-squared (Coefficient of Determination)

r2_score(y_test, y_pred)

🔹 Indicates how much variation in Y is explained by X (ranges from 0 to 1).

🔢 Mean Absolute Error (MAE)

mean_absolute_error(y_test, y_pred)

🔹 Average of absolute errors.

🔢 Mean Squared Error (MSE)

mean_squared_error(y_test, y_pred)

🔢 Root Mean Squared Error (RMSE)

np.sqrt(mean_squared_error(y_test, y_pred))

🎓 Expert Opinions and Best Practices

According to Dr. Sebastian Raschka (AI/ML Professor and Author):

"Linear regression is an interpretable baseline model that should always be the first attempt before trying complex algorithms."

Best practices:

  • Always perform EDA before modelling.

  • Check for assumptions using residual plots.

  • Use regularisation (Ridge or Lasso) if overfitting is detected.

  • Prefer scaled inputs for better convergence.

⚠️ Effects and Pitfalls

Issue Effect on Model
Multicollinearity Inflated variance, unstable coefficients
Outliers Skewed results
Non-linearity Inaccurate predictions
Violated assumptions Misleading inferences

🔧 Solutions:
Use transformations, feature selection, and robust regression techniques.

💡 Final Thoughts and Suggestions

Linear regression may seem basic, but it holds strong for problems where interpretability and simplicity matter. It provides an excellent starting point for data exploration and benchmarking.

✅ Actionable Suggestions

  • Use statsmodels for detailed statistical outputs.

  • For multiple regression, always check VIF (Variance Inflation Factor).

  • Use cross-validation for better generalisation.

📚 Additional Learning Resources

  • Book: “Introduction to Statistical Learning” by James et al.

  • Course: Andrew Ng’s Machine Learning Specialisation (Coursera)

  • Tool: Google Colab – for cloud-based Python execution.

📝 Disclaimer

While I am not a certified machine learning engineer or data scientist, I have thoroughly researched this topic using trusted academic sources, official documentation, expert insights, and widely accepted industry practices to compile this guide. This post is intended to support your learning journey by offering helpful explanations and practical examples. However, for high-stakes projects or professional deployment scenarios, consulting experienced ML professionals or domain experts is strongly recommended.
Your suggestions and views on machine learning are welcome—please share them below!

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Linear Regression Explained with Python Code – Theory, Assumptions, Implementation, and Evaluation",
  "description": "Learn linear regression with Python from theory to code and evaluation in a beginner friendly way",
  "author": {
    "@type": "Person",
    "name": "Rajiv Dhiman"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Focus360Blog",
    "logo": {
      "@type": "ImageObject",
      "url": "https://www.focus360blog.online/images/logo.png"
    }
  },
  "datePublished": "2025-06-05",
  "dateModified": "2025-06-05",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://www.focus360blog.online/2025/06/linear-regression-explained-with-python.html"
  }
}
🏠

Previous Post 👉 Types of Machine Learning Algorithms – Overview of classification, regression, clustering, and dimensionality reduction

Next Post 👉 Logistic Regression for Classification Problems – Use cases, ROC curve, confusion matrix

Post a Comment

Previous Post Next Post