A Comprehensive Approach to Image Spam Detection: From Server to Client Solution PDF  | Print |
Written by Nazanin Firoozeh   
Tuesday, 28 December 2010 15:52
Unlike other image spam filtering solutions which only use client-side filtering, [Gao, Choudhary, and Hua] [A Comprehensive Approach to Image Spam Detection: From Server to Client Solution], proposed a method which utilizes server-side filtering in addition to client-side filtering. This server-side filtering is able to filter a set of image spam instead of filtering them one by one. In this type of filtering, image attachments are clustered using nonnegative sparsity induced similarity measure and spectrum clustering algorithm is used for the first time in spam image filtering for obtaining clustering results. Assumption of the proposed similarity measure is that an image belongs to a cluster if it can be reconstructed using the nonnegative linear combination of a small number of samples from the same cluster. Since larger clusters are more suspicious to contain spam images, they will be analysis more in order to find their source and block them. Smaller clusters are then passed to client side in which an active learning image spam hunter is used to guide users to label small number of spam images while keep the accuracy of classification maximum. During classification, the classifier is updated by users which label images until the accuracy goes over a predefined threshold. Two active learning classifiers were examined in this work, the first one based on support vector machine (SVM) and the second one based on Gaussian process (GP) classifier.

For examining the proposed method, authors utilized a dataset which contains 1190 spam images and 1760 non-spam ones. In addition, 23 features related to color, text, shape and appearance of images were extracted for spam filtering task in both sides. Here, these features can be extracted from an image less than 10ms. Results show that spam images have completely different visual statistics in comparison with non-spam ones. As a result, these features can successfully separate spam and normal images.

For evaluating similarity measure in server-side, two criteria were used –average clustering accuracy (CAC) and normalized mutual information between cluster results and real clusters. Results indicate that the proposed method had a good performance for clustering with CAC=0.635 and µMI=0.734. Furthermore, clustering results achieved by the proposed similarity measure, have a good level of stability against different initialization. On the other hand, for evaluating active learning classifier in client-side, false positive rate (FPR) and true positive rate (TPR) were calculated. Results show that active learning and specifically active learning SVM needs a few images to be labeled to gain recognition accuracy over 0.99%. Overall, active learning SVM performs better in this experiment in comparison with active learning GP classifier.

As future work, IP tracing techniques also can be used with the proposed server-side system in order to detect spammers’ emails or IPs. Moreover, UI/UX designs can be explored to be used with the client-side active learning classifier with mainstream e-mail clients. And finally, more discriminative features can be extracted from images to improve the performance of the clustering and so the performance of the proposed method.

References:

Yan Gao, Alok Choudhary, Gang Hua. A Comprehensive Approach to Image Spam Detection: From Server to Client Solution. IEEE Transaction on Information Forensics and Security.
 

Cost of Web Spam in Australia

The aim of this project is to model the cost of Web Spam to the Australian economy. This will be measured in terms of loss of GDP, loss in employee or user productivity etc. This will be the first study that looks at the cost incurred due to Web Spam i.e. blogs spam, forum spam, wiki spam etc. since most of the existing studies are focussed on email spam. More...

Characterising Bot Behaviour

This project aims to characterize spambot behaviour to eliminate spam. A number of honeypots are deployed to collect and analyze spambots behaviour to help distinguish human users from spambots. Next generation data mining and web usage mining techniques are applied to gain insights into spambot behaviour. More...

Spam 2.0 Live Stat

This project aims to provide a live interface that would show the amount of Web 2.0 Spam that is entering the Internet on a daily basis. The information would be fed by numerous honeypots that we have setup on a number of web servers across the globe. Being a research institute we aim to provide unbiased information on the current state of Web 2.0 Spam. More...