Spam Classification using Deep Neural Network Architecture

By - Pruthivi Raj Behera, Shreya Goel and Roshan S

Email is a hot target for spam. Image Source

As we all know that Spam Classification is a classic problem in Machine Learning domain. Although it sounds easy but the classification of messages into spam or legit poses a huge challenge for Machine Learning beginners. Some of the challenges that we faced include:-

  • Length of message and emails: Since the length of an email is average of 100–150 words but they can extend possibly into 1000’s of words (for example, in 1 training instance, the email had a length of 8300 words!). On the other hand, a text message is relatively of small size which is few words as small as 2 words.

Dataset

Experiments are carried out on two publicly available datasets namely SMS spam CollectionData Set (SMS) and Enron Email Dataset.

SMS Dataset:

The SMS spam Collection Data Set is hosted at the UCI Machine Learning repository. This is a publicly available dataset of SMS labelled messages which were collected for research on the mobile phone spam messages.

It consists of 5572 text messages which has 4825 ham messages and 747spam messages. The dataset is a collection of the messages from various sources which include the Grumbletext Website (425 Spam SMS), National University of Singapore SMS Corpus (3375 HamSMS), Caroline Tag’s PhD Thesis (450 Ham SMS)and SMS Corpus v.0.1 Big (1002 Ham SMS and 322 Spam SMS).

E-Mail Dataset:

The Enron-Email dataset has around 30000 emails out of which a subset containing 17000 messages have been taken for experimentation. These emails are of spam and ham categories.

The data is divided into 5 folds and each fold corresponds to an employee of the Enron organisation. Each fold has both ham and spam messages. For example, one of the fold has an owner named ”kaminski-v” with a total of 5857 emails out of which 4361 are legitimate and 1496 are spam.

Data Visualization

Before proceeding with the classification task, we had to visualize both the datasets first. We started with finding the distribution of ham and spam words in both the datasets respectively. We plotted a pie chart for easier visualization as shown below:

Distribution of Ham and Spam words in the Datasets

Then, we made a histogram comparing the number of words in ham and spam messages both.

We observe that the number of words in spam messages tend to larger as compared in both the datasets.

Then, we looked into top words in both spam and ham categories along with their frequencies. We made a word cloud as shown below:

The larger the word, the more frequent it is.

Baseline Models

We first tried on different baseline models which include SVM, Logistic Regression, Decision Trees, Random Forests and KNN. We got the following accuracies as shown:

We observed that the SVM performed the best for the classification of spam words in the SMS Dataset as shown in Table 1.

Similarly, for the email dataset, we observed that SVM performed the best closely followed by the Random Forest and Logistic Regression.

Proposed Architecture

Now, we propose a different architecture. In baseline models, we had used CountVectorizer() and TfIDF for vectorization process. For our proposed system, we implemented GloVe and used features matrix obtained from the embedding layer for vectorization.

In case, you are not familiar with GloVe, GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Proposed Architecture

The architecture has 2 phases which are:

  1. Feature Extraction: In this phase, an LSTM based model is used to extract deep features. First, the model trains on the training data and after training the output of the embedding layer are extracted as a feature vector for both train and test data.

Following is the sample code for the implementation:

model = Sequential()model.add(Embedding(input_dim=vocab_size, output_dim=32, input_length=maxlen))model.add(LSTM(units=32))model.add(Dense(1, activation='sigmoid'))model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

As we can see, we are using only 1 Embedding layer (with which we are extracting the feature matrix), 1 LSTM layer and 1 Dense only.

Results

We have performed experiments using our proposed feature extraction module with different machine learning models for both the data sets and the results are shown in Tables 3 and 4. We can observe that SVM and Logistic Regression models complement well with the extracted deep features using LSTM for both the Datasets.

We have also done a baseline study for both the data sets using a few machine learning models with TF-IDF features and the results are shown in Tables 1 and 2.

We can observe that in the SMS data set, Logistic Regression with LSTM features have an increase in accuracy of around 4% and SVM with LSTM features shows an increase of about 1% when compared with the accuracy of their respective baseline models.

In the Enron Email Data set, we can see that SVM with LSTM features shows an improvement of 1% while Logistic Regression with LSTM features.

Conclusions

As we can see that Emails are the easy and preferred way of communication among the business organizations even in the long term and SMS refers to the message services which is used in the mobile devices.

Emerging with the technology and internet facilities, there has always been seen as an upsurge in the volume of spam emails and messages. So, the proposed method for filtering of the spam messages and emails uses the feature extraction of the deep features from the LSTM based model and then feeding these features as an input to the baseline machine learning models to predict the message/email as ham or spam.

There has been an increase in the accuracy when Logistic Regression and Support Vectors were used as the machine learning models with LSTM based model features fed as input when compared with their respective baseline models accuracy on both the Email and SMS dataset. So, these both models were able to perform well when they were complemented with the extracted deep features.

Acknowledgements

We would like to thank our Professor Dr Tanmoy Chakraborty and our Teaching Fellow and Teaching Assistants for their support and guidance throughout the project.

Teaching Fellow: Ms. Ishita Bajaj

Teaching Assistants: Chhavi Jain, Pragya Srivastava, Shiv Kumar Gehlot, Vivek Reddy, Shikha Singh, and Nirav Diwan.

Social media profiles link of Dr Tanmoy Chakraborty:

LinkedIn: https://www.linkedin.com/in/tanmoy-chakraborty-89553324/

Twitter: @Tanmoy_Chak

Facebook: https://www.facebook.com/chak.tanmoy

#MachineLearning2020.

Individual Contributions

This project was a combined effort overall. So, we all collaborated as a team and co-ordinated throughout the project duration. Everyone participated in all parts of the project.

Below are contributions of each as an individual:-

Pruthivi Raj Behera: Literature review, Data pre-processing and analysis, Data Visualization, Implementation of the baseline models and proposed model on the Email dataset and Blog.

Roshan S: Literature review, Report writing, Data pre-processing, analysis and Data Visualization of the SMS dataset, Implementation of the proposed model on the SMS dataset, Idea and proposal.

Shreya Goel: Literature review, Report writing, Presentation, Data pre-processing and analysis, Data visualization and Implementation of baseline models on the SMS dataset, Analysis of results obtained from proposed models.

References

  1. Dea Delvia Arifin, Moch Arif Bijaksana, et al. 2016. Enhancing spam detection on mobile phone short message service (SMS) performance using fp-growth and naive Bayes classifier. In2016 IEEE Asia PacificConference on Wireless and Mobile (APWiMob), pages 80–84. IEEE.

Master's Student at IIIT Delhi.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store