It is a study of computer algorithms that improve without the experience. It allows accessing the data and utilizing it for themselves.
Algorithms are designed using probability. Learning algorithms make decisions using probability.
It is a mathematical study of data. It is summarise observations into understandable information finds patterns in data to find predictive patterns.
The Email Spam Classifier uses a classifier for email spam detection using Naive Bayes Theorem.
The theorem is used to calculate the probability of each individual word given ham or spam in the email documents.
The probabilities of ham and spam classes and plug them into the above equation.
Naive Bayes Theorem is fast and robust in comparison to other theorems.
Since it is widely used for text classification problems, there are many resources or learning materials available.
A popular Machine Learning toolkit that implements the Naive Bayes Theorem is Scikit-learn.
Natural Language Processing is a branch of artificial intelligence that deals with the interaction
between computers and humans using the natural language.
Its ultimate objective of NLP is to read, decipher, understand, and make sense of
the human languages in a manner that is valuable.
Scikit-learn is an open-source python Machine Learning library training classifier.
# Import libraries (pandas, sklearn, numpy, nltk)
# Load the data
# Dataset: https://www.kaggle.com/balakishan77/spam-or-ham-email-classification
# Exploratory Data Analysis and Preprocessing
# Printing first few rows of data, remove duplicates, view missing data
# Make a function that would clean the text using stopwords (useless words in data and nonwords like punctuation marks or special marks)
# Inside function, tokenize the data (split sentences into lists of key words)
# Exhibit the tokenization
# Encode text into token counts for machine learning using CountVectorizer
# Split the training datasets and testing datasets (Keep it as 80% training and 20% testing)
# Create and train the Naive Bayes Classifier
# Print out predicted and actual values for the spam/ham classification
# Check the accuracy (%) of the model during training
# Evaluate on the testing data set the Model