TR2010-064

Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier


    •  Yerazunis, W.S.; Kato, M.; Kori, M.; Shibata, H.; Hackenberg, K., "Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier", Black Hat Technical Security Conference, July 2010.
      BibTeX Download PDF
      • @inproceedings{Yerazunis2010jul,
      • author = {Yerazunis, W.S. and Kato, M. and Kori, M. and Shibata, H. and Hackenberg, K.},
      • title = {Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier},
      • booktitle = {Black Hat Technical Security Conference},
      • year = 2010,
      • month = jul,
      • url = {http://www.merl.com/publications/TR2010-064}
      • }
  • MERL Contact:
  • Research Area:

    Data Analytics


In this whitepaper we consider the problem of outbound-filtering of emails to prevent accidental leakage of confidential information. We examine how to do this with GPLed open-source spam filter CRM114 and test the accuracy of this filter against a 10,000+ document corpus of hand-classified emails (both confidential and non-confidential) in Japanese. We look into what moving parts are involved in these filters, and how they can be set up. The results show that a hybrid of multiple CRM114 filters outperforms a human-crafted regular-expression filter by nearly 100x in recall, by detecting greater-than 99.9% of confidential documents, and with a simultaneous false alarm rate of less than 6%. As the programmers creating the machine-learning programs don't know how to read or write Japanese, this problem is an almost ideal case of the Searle "Chinese Room" problem.