Mitsubishi Electric Research Laboratories

Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier

Citation:   Yerazunis, W.S.; Kato, M.; Kori, M.; Shibata, H.; Hackenberg, K., "Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier", Black Hat Technical Security Conference, July 2010 (Black Hat USA 2010)
MERL Report:  TR2010-064

In this whitepaper we consider the problem of outbound-filtering of emails to prevent accidental leakage of confidential information. We examine how to do this with GPLed open-source spam filter CRM114 and test the accuracy of this filter against a 10,000+ document corpus of hand-classified emails (both confidential and non-confidential) in Japanese. We look into what moving parts are involved in these filters, and how they can be set up. The results show that a hybrid of multiple CRM114 filters outperforms a human-crafted regular-expression filter by nearly 100x in recall, by detecting > 99.9% of confidential documents, and with a simultaneous false alarm rate of less than 6%. As the programmers creating the machine-learning programs don't know how to read or write Japanese, this problem is an almost ideal case of the Searle "Chinese Room" problem.

 Read the full technical report (PDF: 184.2 kB)