Academic Journals Database
Disseminating quality controlled scientific knowledge

Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach

Author(s): Ahmed Irfan | Lhee Kyung-suk | Shin Hyunjung | Hong ManPyo

Journal: IETE Technical Review
ISSN 0256-4602

Volume: 27;
Issue: 6;
Start page: 465;
Date: 2010;
Original page

Keywords: Byte frequency distribution | Cosine similarity | Clustering | File type identification | Mahalanobis distance | Neural network

Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies.
Why do you need a reservation system?      Affiliate Program