Unlike other image spam filtering solutions which only use client-side filtering, [Gao, Choudhary, and Hua] [A Comprehensive Approach to Image Spam Detection: From Server to Client Solution], proposed a method which utilizes server-side filtering in addition to client-side filtering. This server-side filtering is able to filter a set of image spam instead of filtering them one by one. In this type of filtering, image attachments are clustered using nonnegative sparsity induced similarity measure and spectrum clustering algorithm is used for the first time in spam image filtering for obtaining clustering results. Assumption of the proposed similarity measure is that an image belongs to a cluster if it can be reconstructed using the nonnegative linear combination of a small number of samples from the same cluster. Since larger clusters are more suspicious to contain spam images, they will be analysis more in order to find their source and block them. Smaller clusters are then passed to client side in which an active learning image spam hunter is used to guide users to label small number of spam images while keep the accuracy of classification maximum. During classification, the classifier is updated by users which label images until the accuracy goes over a predefined threshold. Two active learning classifiers were examined in this work, the first one based on support vector machine (SVM) and the second one based on Gaussian process (GP) classifier.
For examining the proposed method, authors utilized a dataset which contains 1190 spam images and 1760 non-spam ones. In addition, 23 features related to color, text, shape and appearance of images were extracted for spam filtering task in both sides. Here, these features can be extracted from an image less than 10ms. Results show that spam images have completely different visual statistics in comparison with non-spam ones. As a result, these features can successfully separate spam and normal images.
For evaluating similarity measure in server-side, two criteria were used –average clustering accuracy (CAC) and normalized mutual information between cluster results and real clusters. Results indicate that the proposed method had a good performance for clustering with CAC=0.635 and µMI=0.734. Furthermore, clustering results achieved by the proposed similarity measure, have a good level of stability against different initialization. On the other hand, for evaluating active learning classifier in client-side, false positive rate (FPR) and true positive rate (TPR) were calculated. Results show that active learning and specifically active learning SVM needs a few images to be labeled to gain recognition accuracy over 0.99%. Overall, active learning SVM performs better in this experiment in comparison with active learning GP classifier.
As future work, IP tracing techniques also can be used with the proposed server-side system in order to detect spammers’ emails or IPs. Moreover, UI/UX designs can be explored to be used with the client-side active learning classifier with mainstream e-mail clients. And finally, more discriminative features can be extracted from images to improve the performance of the clustering and so the performance of the proposed method.
Yan Gao, Alok Choudhary, Gang Hua. A Comprehensive Approach to Image Spam Detection: From Server to Client Solution. IEEE Transaction on Information Forensics and Security.