Introduction

Machine Learning

It is a study of computer algorithms that improve without the experience. It allows accessing the data and utilizing it for themselves.

Probability

Algorithms are designed using probability. Learning algorithms make decisions using probability.

Probability

It is a mathematical study of data. It is summarise observations into understandable information finds patterns in data to find predictive patterns.

Literature Reviews

Comparative Study on Email Spam Classifier using Data Mining Techniques

Phishing Email Detection Based on Structural Properties

  • Study proposes a way to identify spam email using distinct structural properties
  • With the use of one-class Support Vector Machine (classification algorithm), potential phishing emails are classified
  • Overall, demonstrates an effective approach to prevent wide exposure of suspicious emails with minimal effort necessary

Council Post: The Dangers of Phishing

  • Rising threat of phishing is presented in this study with statistics
  • Presents the problem with spam emails and social engineering, need for spam classification filters
  • Provides brief overview of machine learning being used to automate spam email searching, classification algorithms such as kNN

Email Spam Classification Using Hybrid Approach of RBF Neural Network and Particle Swarm Optimization

Tying into Computer Science Aspect


Naive Bayes Theorem

The Email Spam Classifier uses a classifier for email spam detection using Naive Bayes Theorem.
The theorem is used to calculate the probability of each individual word given ham or spam in the email documents.
The probabilities of ham and spam classes and plug them into the above equation.


Reasons to use Naive Bayes Theorem

Naive Bayes Theorem is fast and robust in comparison to other theorems.
Since it is widely used for text classification problems, there are many resources or learning materials available.
A popular Machine Learning toolkit that implements the Naive Bayes Theorem is Scikit-learn.



NATURAL LANGUAGE PROCESSING

Natural Language Processing is a branch of artificial intelligence that deals with the interaction
between computers and humans using the natural language.

Its ultimate objective of NLP is to read, decipher, understand, and make sense of
the human languages in a manner that is valuable.

Scikit-learn

Scikit-learn is an open-source python Machine Learning library training classifier.

Preformatted


# Import libraries (pandas, sklearn, numpy, nltk)
# Load the data
# Dataset: https://www.kaggle.com/balakishan77/spam-or-ham-email-classification
# Exploratory Data Analysis and Preprocessing
# Printing first few rows of data, remove duplicates, view missing data
# Make a function that would clean the text using stopwords (useless words in data and nonwords like punctuation marks or special marks)
# Inside function, tokenize the data (split sentences into lists of key words)
# Exhibit the tokenization
# Encode text into token counts for machine learning using CountVectorizer
# Split the training datasets and testing datasets (Keep it as 80% training and 20% testing)
# Create and train the Naive Bayes Classifier 
# Print out predicted and actual values for the spam/ham classification
# Check the accuracy (%) of the model during training
# Evaluate on the testing data set the Model