Spam Email Classifier using NLP and Naive Bayes Model

Spam Email Classifier using NLP and Naive Bayes with text vectorisation and email filtering 

Introduction

In today's digital age, email remains one of the most widely used communication tools. However, with this widespread usage comes the nuisance of spam emails—unsolicited messages that flood inboxes and compromise user experience and security. In this professional-level blog post, we'll walk through how to build a Spam Email Classifier using NLP and Naive Bayes model, leveraging vectorisation techniques like CountVectorizer or TfidfVectorizer in Python.

This is a beginner-to-intermediate friendly project suitable for both students and working professionals looking to strengthen their grip on natural language processing, machine learning and spam filtering.

Why Build a Spam Email Classifier?

Spam filters are critical for detecting malicious, irrelevant, or promotional emails. Effective spam detection systems help in:

  • Protecting users from phishing attacks

  • Reducing data clutter

  • Enhancing productivity

  • Improving deliverability for genuine messages

Expert View:

"Using a spam email classifier based on NLP and Naive Bayes is a proven and effective method. Its simplicity and efficiency make it highly suitable for real-time applications," says Dr. A. Mehta, AI Researcher, University of Cambridge.

Step-by-Step Guide to Building a Spam Classifier Using NLP

1. Import Libraries

You'll need some popular Python libraries to begin:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

2. Dataset Preparation

We'll use the famous SMS Spam Collection dataset.

df = pd.read_csv("spam.csv", encoding='latin-1')[['v1', 'v2']]
df.columns = ['label', 'message']
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

3. Text Preprocessing with NLP

Natural Language Processing (NLP) is used to clean and prepare textual data.

import re
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = ''.join([char for char in text if char not in string.punctuation])
    text = ' '.join([PorterStemmer().stem(word) for word in text.split() if word not in stopwords.words('english')])
    return text

df['cleaned_message'] = df['message'].apply(clean_text)

4. Vectorisation (Feature Extraction)

Here, we convert text into numerical features using CountVectorizer or TfidfVectorizer.

vectorizer = TfidfVectorizer(max_features=3000)
X = vectorizer.fit_transform(df['cleaned_message']).toarray()
y = df['label'].values

5. Splitting the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

6. Naive Bayes Classification

Multinomial Naive Bayes is ideal for word counts or frequencies.

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

7. Evaluation Metrics

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

You should expect accuracy above 95% on well-cleaned and balanced data.

Minor Enhancements

Save and Use the Model Later

import pickle
pickle.dump(model, open('spam_classifier.pkl', 'wb'))
pickle.dump(vectorizer, open('vectorizer.pkl', 'wb'))

Predicting New Emails

def predict_spam(text):
    cleaned = clean_text(text)
    vector = vectorizer.transform([cleaned]).toarray()
    return "Spam" if model.predict(vector)[0] else "Ham"

Important Notes

  • Use TfidfVectorizer for better weighting of words based on their importance.

  • Preprocessing is key — never skip cleaning.

  • Always evaluate using multiple metrics — not just accuracy.

Real-World Applications

  • Gmail’s spam filtering mechanism

  • Corporate firewalls for phishing detection

  • E-commerce websites for email campaign filtering

  • IoT email alerts for filtering false warnings

Final Thoughts on Building Spam Email Classifier using NLP and Naive Bayes

The project outlined above is not only practical but also a strong foundational block for NLP-based applications. By implementing a Spam Email Classifier using NLP and Naive Bayes model, you gain valuable experience in preprocessing text, vectorisation, and machine learning workflows — all essential skills for modern AI developers.

For long-term improvement, you may explore:

  • Word embeddings (like Word2Vec)

  • Ensemble models

  • Deep Learning techniques (LSTM or Transformer-based)

Expert Insight:

"A good spam classifier must evolve. Adding feedback loops and retraining with fresh spam data keeps performance high," suggests Dr. Kavita Rao, NLP Engineer at Infosys.



Disclaimer:
While I am not a certified machine learning engineer or data scientist, I have thoroughly researched this topic using trusted academic sources, official documentation, expert insights, and widely accepted industry practices to compile this guide. This post is intended to support your learning journey by offering helpful explanations and practical examples. However, for high-stakes projects or professional deployment scenarios, consulting experienced ML professionals or domain experts is strongly recommended.

Your suggestions and views on machine learning are welcome—please share them below!

 

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Spam Email Classifier using NLP and Naive Bayes Model",
  "description": "Learn how to build a Spam Email Classifier using NLP and Naive Bayes with vectorisation, evaluation metrics and expert advice in Python",
  "author": {
    "@type": "Person",
    "name": "Rajiv Dhiman"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Focus360Blog",
    "logo": {
      "@type": "ImageObject",
      "url": "https://www.focus360blog.online/images/logo.png"
    }
  },
  "datePublished": "2025-06-26",
  "dateModified": "2025-06-26",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://www.focus360blog.online/2025/06/spam-email-classifier-using-nlp-and.html"
  }
}
🏠

Read more Like this here

Post a Comment

Previous Post Next Post