Spam Classification using Deep Neural Network Architecture
As we all know that Spam Classification is a classic problem in Machine Learning domain. Although it sounds easy but the classification of messages into spam or legit poses a huge challenge for Machine Learning beginners. Some of the challenges that we faced include:-
- Length of message and emails: Since the length of an email is average of 100–150 words but they can extend possibly into 1000’s of words (for example, in 1 training instance, the email had a length of 8300 words!). On the other hand, a text message is relatively of small size which is few words as small as 2 words.
- Frequency of Emails: Emails are also sometimes too frequent in some cases which can be misclassified as spam. Same is the case with messages.
- Abbreviations in the messages: Emails are written mostly in formal language but a text message is written in an informal language which causes a problem in the detection of spam.
Experiments are carried out on two publicly available datasets namely SMS spam CollectionData Set (SMS) and Enron Email Dataset.
The SMS spam Collection Data Set is hosted at the UCI Machine Learning repository. This is a publicly available dataset of SMS labelled messages which were collected for research on the mobile phone spam messages.
It consists of 5572 text messages which has 4825 ham messages and 747spam messages. The dataset is a collection of the messages from various sources which include the Grumbletext Website (425 Spam SMS), National University of Singapore SMS Corpus (3375 HamSMS), Caroline Tag’s PhD Thesis (450 Ham SMS)and SMS Corpus v.0.1 Big (1002 Ham SMS and 322 Spam SMS).
The Enron-Email dataset has around 30000 emails out of which a subset containing 17000 messages have been taken for experimentation. These emails are of spam and ham categories.
The data is divided into 5 folds and each fold corresponds to an employee of the Enron organisation. Each fold has both ham and spam messages. For example, one of the fold has an owner named ”kaminski-v” with a total of 5857 emails out of which 4361 are legitimate and 1496 are spam.
Before proceeding with the classification task, we had to visualize both the datasets first. We started with finding the distribution of ham and spam words in both the datasets respectively. We plotted a pie chart for easier visualization as shown below:
Then, we made a histogram comparing the number of words in ham and spam messages both.
We observe that the number of words in spam messages tend to larger as compared in both the datasets.
Then, we looked into top words in both spam and ham categories along with their frequencies. We made a word cloud as shown below:
We first tried on different baseline models which include SVM, Logistic Regression, Decision Trees, Random Forests and KNN. We got the following accuracies as shown:
We observed that the SVM performed the best for the classification of spam words in the SMS Dataset as shown in Table 1.
Similarly, for the email dataset, we observed that SVM performed the best closely followed by the Random Forest and Logistic Regression.
Now, we propose a different architecture. In baseline models, we had used CountVectorizer() and TfIDF for vectorization process. For our proposed system, we implemented GloVe and used features matrix obtained from the embedding layer for vectorization.
In case, you are not familiar with GloVe, GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
The architecture has 2 phases which are:
- Feature Extraction: In this phase, an LSTM based model is used to extract deep features. First, the model trains on the training data and after training the output of the embedding layer are extracted as a feature vector for both train and test data.
- Modelling: In this phase, the extracted deep feature is fed as input to the machine learning models like SupportVector Machines(SVM), K Neighbours (KNN), Decision Trees (DT), Logistic Regression (LR)and Random Forests (RFC) to predict whether the message is spam/ham.
Following is the sample code for the implementation:
model = Sequential()model.add(Embedding(input_dim=vocab_size, output_dim=32, input_length=maxlen))model.add(LSTM(units=32))model.add(Dense(1, activation='sigmoid'))model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
As we can see, we are using only 1 Embedding layer (with which we are extracting the feature matrix), 1 LSTM layer and 1 Dense only.
We have performed experiments using our proposed feature extraction module with different machine learning models for both the data sets and the results are shown in Tables 3 and 4. We can observe that SVM and Logistic Regression models complement well with the extracted deep features using LSTM for both the Datasets.
We have also done a baseline study for both the data sets using a few machine learning models with TF-IDF features and the results are shown in Tables 1 and 2.
We can observe that in the SMS data set, Logistic Regression with LSTM features have an increase in accuracy of around 4% and SVM with LSTM features shows an increase of about 1% when compared with the accuracy of their respective baseline models.
In the Enron Email Data set, we can see that SVM with LSTM features shows an improvement of 1% while Logistic Regression with LSTM features.
As we can see that Emails are the easy and preferred way of communication among the business organizations even in the long term and SMS refers to the message services which is used in the mobile devices.
Emerging with the technology and internet facilities, there has always been seen as an upsurge in the volume of spam emails and messages. So, the proposed method for filtering of the spam messages and emails uses the feature extraction of the deep features from the LSTM based model and then feeding these features as an input to the baseline machine learning models to predict the message/email as ham or spam.
There has been an increase in the accuracy when Logistic Regression and Support Vectors were used as the machine learning models with LSTM based model features fed as input when compared with their respective baseline models accuracy on both the Email and SMS dataset. So, these both models were able to perform well when they were complemented with the extracted deep features.
We would like to thank our Professor Dr Tanmoy Chakraborty and our Teaching Fellow and Teaching Assistants for their support and guidance throughout the project.
Teaching Fellow: Ms. Ishita Bajaj
Teaching Assistants: Chhavi Jain, Pragya Srivastava, Shiv Kumar Gehlot, Vivek Reddy, Shikha Singh, and Nirav Diwan.
Social media profiles link of Dr Tanmoy Chakraborty:
This project was a combined effort overall. So, we all collaborated as a team and co-ordinated throughout the project duration. Everyone participated in all parts of the project.
Below are contributions of each as an individual:-
Pruthivi Raj Behera: Literature review, Data pre-processing and analysis, Data Visualization, Implementation of the baseline models and proposed model on the Email dataset and Blog.
Roshan S: Literature review, Report writing, Data pre-processing, analysis and Data Visualization of the SMS dataset, Implementation of the proposed model on the SMS dataset, Idea and proposal.
Shreya Goel: Literature review, Report writing, Presentation, Data pre-processing and analysis, Data visualization and Implementation of baseline models on the SMS dataset, Analysis of results obtained from proposed models.
- Dea Delvia Arifin, Moch Arif Bijaksana, et al. 2016. Enhancing spam detection on mobile phone short message service (SMS) performance using fp-growth and naive Bayes classifier. In2016 IEEE Asia PacificConference on Wireless and Mobile (APWiMob), pages 80–84. IEEE.
- Sahar Bosaeed, Iyad Katib, and Rashid Mehmood.2020.A fog-augmented machine learning-based SMS spam detection and classification system. In2020 Fifth International Conference on Fog and Mobile Edge Computing (FMEC), pages 325–330.IEEE.
- Paul-Alexandru Chirita, J ̈org Diederich, and WolfgangNejdl. 2005.Mailrank: using ranking for spam detection. InProceedings of the 14th ACM inter-national conference on Information and knowledge management pages 373–380.
- Ersin Enes Eryılmaz, Durmus ̧ ̈Ozkan S ̧ ahin, and ErdalKılıc ̧. 2020. Filtering Turkish spam using LSTM from deep learning techniques. In2020 8th InternationalSymposium on Digital Forensics and Security (IS-DFS), pages 1–6. IEEE.
- Asra Ishtiaq, Muhammad Arshad Islam, Muhammad Azhar Iqbal, Muhammad Aleem, and Usman Ahmed. 2019. Graph centrality based spam SMS detection. In2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), pages 629–633. IEEE.
- S. Nandhini and J. Marseline K.S. 2020. Performance evaluation of machine learning algorithms for email spam detection. In2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), pages 1–4.
- Milivoje Popovac, Mirjana Karanovic, Srdjan Sladoje-vic, Marko Arsenovic, and Andras Anderla. 2018.Convolutional neural network-based SMS spam detection. In2018 26th Telecommunications Forum(TELFOR), pages 1–4. IEEE.
- Muhammad Zubair Rafique and Muhammad Abulaish. 2012. Graph-based learning model for detection of SMS spam on smartphones. In 2012 8th International Wireless Communications and Mobile Computing Conference (IWCMC), pages 1046–1051. IEEE.
- S. K. Trivedi. 2016. A study of machine learning classifiers for spam detection. In2016 4th International Symposium on Computational and Business Intelligence (ISCBI), pages 176–180.