Malware detector and classifier based on static analysis of PE executables

View the Project on GitHub deut-erium/Mal-det-cal

Malware Detection cum Classifier

Detects and classifies malware from static and dynamic analysis of a PE executable

Generating the required structure

If you have to test a PE executable or a path for all PE executables, run
python3 file <path/to/file> <path/to/savefile>

python3 path <path/to/file> <path/to/savefile>
This will check all files recursively under path and produce the required directory structure in <path/to/savefile>
And a list of (filepath, sha256) pairs in files_info.txt

Run python3 <path/to/savefile> to check the files

Requirements to run

We have run on python3.8, use requirements.txt to install the required packages
Other requirements being about a GB of free space in the directory in which code is being run

Submission Structure

Usage info: python3 <path to testing data> Uses the model trained_model.pickle
Output is stored in output.csv Intermediate CSV temp_features.csv is generated containing selected features.

Assumptions on test directory

The test directory has the requirements Static analysis data has String.txt and Structure_Info.txt under the directory of SHA256(file_name) somewhere in the supplied path
Dynamic Analysis data as SHA256(file_name).json somewhere in the supplied path

   ├── String.txt
   └── Structure_Info.txt
└── 0a2adcac2b16b02d475e9d47b4772b77b0b4269132f07557c7ef6081727585da.json

The file used to train the model. The training is done in two phases:

  1. Data collection and feature selection, the dynamic and static analysis data is parsed for all the files and stored temporarily in filtered_data under filename <hash>_filtered.json
    If training data is not filtered already,
    Usage: python3 filter <path to training data>

    NOTE: It is assumed that training data contains the directory “Static_Analysis_Data”

If the training data has been filtered already into filtered_data
Usage: python3
Generates training_data.csv and trained_model.csv

Extract info in required format from a file or from all files in path


Compressed filtered_data, use 7z x filtered.7z -ofiltered_data to create filtered_data directory


Trained model generated after training from filtered_data


Output csv containing filename, class pairs

Generated on running


Temporary selected features of test_data are stored

Generated on running


Filtered training data is stored in this directory used for further training of model

7z x filtered.7z -ofiltered_data


Generates an intermediate csv of selected features from filtered_data

generated on running


Details of packages installed
pip3 install -r requirements.txt

Cleans the temporary results in filtered_data and temp_features.csv Usage: python3


Temporary directory generated on running which contains the filtered test files

Generated on running

Feature Extraction (Pre-Processing)

Since the training data is HUGE and contains a lot of unnecessary information, both training data and test data are pre-processed to get intermediate filtered_data. We have parsed the data to extract the following vaules

Static Analysis Data


parsed_data: Dictionary with following features


filtered_data: A dictionary with selected fields namely

Dynamic Analysis Data

Feature Selection

Section Entropy

The PE sections contain data which usually has a known entropy. Higher entropy can indicate packed data. Malicious files are commonly packed to avoid static analysis since the actual code is usually stored encrypted in one of the sections and will only be extracted at runtime.
So entropy is an important feature, we used average, min and max entropy as separate features.

Number of Strings (num_strings)

Number of strings in the strings.txt file is an important feature because most malwares are packed and contain less strings as compared to a benign executable that is not packed.

Size of file (Length)

As most malwares are packed their size is generally smaller than a benign file. In the plot, we can see that most benign executables have larger size.

Packer present

Malware files are usually packed with common packers like UPX, ASPack, etc. They can be identified using the header of the files where the signature of the packer is present.

UDP Destination Address

Most malwares try to connect to a remote server, either to transfer data or to establish a reverse shell. The number of UDP destination addresses can very effectively differentiate a malware from a benign executable, as malware are likely to make more UDP calls.


In training phase, the most time is taken for filtering features, it takes about two hours to parse and generate intermediate files for about 10000 training files.
Once filtered_data is generated (which has been provided in the package), training a model takes about 2 minutes.

We chose DecisionTree Classifier as the base estimator for Bagging Classifier The classifier classifies between the classes Benign, Trojan, Virus, Worm, Trojandownloader, Trojandropper, Backdoor with a training accuracy of 96% Squishing the classifier to Benign and Malware produces a very good classifier with almost 100% precision, recall and fscore even with training/testing split to be 60/40.

BAG SCORE: 0.9798792756539235
PRECISION: [1. 1.]
RECALL: [1. 1.]
F-SCORE: [1. 1.]