Malware Detection using Machin Learning

Sanjeeva Rao Palla
7 min readDec 31, 2020

Objective : Identify whether a given piece of file/software is a malware

What is Malware?

The term malware is a contraction of malicious software. Put simply, malware is any piece of software that was written with the intent of doing harm to data, devices or to people.

Malware refers to malicious software perpetrators dispatch to infect individual computers or an entire organization’s network. It exploits target system vulnerabilities, such as a bug in legitimate software (e.g., a browser or web application plugin) that can be hijacked.

There are many types of malware — viruses, Trojans, spyware, ransomware, and more

What does malware do?

All kinds of things. It’s a very broad category, and what malware does or how malware works changes from file to file. The following is a list of common types of malware, but it’s hardly exhaustive:

  • Virus: Like their biological namesakes, viruses attach themselves to clean files and infect other clean files. They can spread uncontrollably, damaging a system’s core functionality and deleting or corrupting files. They usually appear as an executable file (.exe).
  • Trojans: This kind of malware disguises itself as legitimate software, or is hidden in legitimate software that has been tampered with. It tends to act discreetly and create backdoors in your security to let other malware in.
  • Spyware: No surprise here — spyware is malware designed to spy on you. It hides in the background and takes notes on what you do online, including your passwords, credit card numbers, surfing habits, and more.
  • Worms: Worms infect entire networks of devices, either local or across the internet, by using network interfaces. It uses each consecutively infected machine to infect others.
  • Ransomware: This kind of malware typically locks down your computer and your files, and threatens to erase everything unless you pay a ransom.
  • Adware: Though not always malicious in nature, aggressive advertising software can undermine your security just to serve you ads — which can give other malware an easy way in. Plus, let’s face it: pop-ups are really annoying.
  • Botnets: Botnets are networks of infected computers that are made to work together under the control of an attacker.

Problem Statement

In the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust software to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware.

Data

1. Source of data

Microsoft has been very active in building anti-malware products over the years and it runs it’s anti-malware utilities over 150 million computers around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families.

This dataset provided by Microsoft contains about 9 classes of malware. ,
Source:

2. Data Overview

For every malware, we have two files,
1. .asm file (read more: https://www.reviversoft.com/file-extensions/asm) 2. .bytes file (the raw data contains the hexadecimal representation of the file’s binary content, without the PE header)
Total train dataset consist of 200GB data out of which 50Gb of data is .bytes files and 150GB of data is .asm files. There are total 10,868 .bytes files and 10,868 .asm files total 21,736 files.
There are 9 types of malwares (9 classes) in our give data Types of Malware:
1. Ramnit
2. Lollipop
3. Kelihos_ver3
4. Vundo
5. Simda
6. Tracur
7. Kelihos_ver1
8. Obfuscator.ACY
9. Gatak

Fig : Example data point of .asm file
Fig : Example data point of .byte file

Exploratory Data Analysis

Lets plot a graph to know the distribution of malware classes in whole dataset,

Fig : Distribution of malware classes in whole data set

Feature Extraction

Byte Files

Lets do some feature extraction on byte files by using unigram bag of words, I have extracted 258 features. Here, I considered each unique hexacode as a feature and count will be the value. Size of the byte file is another feature. Total 259 features for byte files.

Fig : byte file features and data

ASM Files

The asm files contains :
1. Address
2. Segments
3. Opcodes
4. Registers
5. function calls
6. APIs

Lets do some feature extraction on asm files by using unigram bag of words, I have extracted 52 features from all the asm files. Here, I considered each unique address, segment, opcode, register, function call and API as a feature and count will be the value. Size of the asm file is another feature. Total 53 features for asm files.

Fig : asm file features and data

Advanced Features

I have extracted bi-gram features from byte files and I have used top 200 bi-gram features for model building.

Fig : byte file bi-gram features

I have extracted bi-grams, tri-grams, tetra-grams and image features from asm files.

Fig : asm file bi-gram features
Fig : asm file tri-gram features
Fig : asm file tetra-gram features
Fig : asm file image features

Data Splitting

Split the dataset randomly into three parts train, cross validation and test with 64%,16%, 20% of data respectively.

Fig : class distribution in train data
Number of data points in class 3 : 1883 ( 27.074 %)
Number of data points in class 2 : 1586 ( 22.804 %)
Number of data points in class 1 : 986 ( 14.177 %)
Number of data points in class 8 : 786 ( 11.301 %)
Number of data points in class 9 : 648 ( 9.317 %)
Number of data points in class 6 : 481 ( 6.916 %)
Number of data points in class 4 : 304 ( 4.371 %)
Number of data points in class 7 : 254 ( 3.652 %)
Number of data points in class 5 : 27 ( 0.388 %)
Fig : class distribution in cross validation data
Number of data points in class 3 : 471 ( 27.085 %)
Number of data points in class 2 : 396 ( 22.772 %)
Number of data points in class 1 : 247 ( 14.204 %)
Number of data points in class 8 : 196 ( 11.271 %)
Number of data points in class 9 : 162 ( 9.316 %)
Number of data points in class 6 : 120 ( 6.901 %)
Number of data points in class 4 : 76 ( 4.37 %)
Number of data points in class 7 : 64 ( 3.68 %)
Number of data points in class 5 : 7 ( 0.403 %)
Fig : class distribution in test data
Number of data points in class 3 : 588 ( 27.047 %)
Number of data points in class 2 : 496 ( 22.815 %)
Number of data points in class 1 : 308 ( 14.167 %)
Number of data points in class 8 : 246 ( 11.316 %)
Number of data points in class 9 : 203 ( 9.338 %)
Number of data points in class 6 : 150 ( 6.9 %)
Number of data points in class 4 : 95 ( 4.37 %)
Number of data points in class 7 : 80 ( 3.68 %)
Number of data points in class 5 : 8 ( 0.368 %)

Tested Models

By using above byte file features, asm file features and advanced features, I have tried below machine learning models:

Fig : Different Models with features

Results

Fig : Results

Source Code

Refer below link for source code of this project,

References

--

--

Sanjeeva Rao Palla

Artificial Intelligence & Machine Learning Engineer Aspirant | Learner