An Automatic Training Data Collection Method for Confidential E-Mail Detection

In this paper, we propose an automatic method for operating a confidential e-mail detection system which uses machine learning and keyword search. The recent information explosion has increased the necessity of the technology which enables the detection of the confidential information in the electronic data. Using methods based on machine learning is one of the way for high accuracy. However, it is difficult to prepare a lot of correct training data manually, and this often becomes a problem for practice. We restrict our attention to e-mail, and present an automatic training data collecting method using the domain information. It allows the automatic operation of the confidential e-mail detection system. We also show the effectiveness of our method through the implementation and the evaluation for an e-mail archive system.