Introduction to x86 Assembly Language – Part I

This article is an attempt to introduce some of the key concepts of x86 Assembly Language. It will focus on how such language is used by malware analyst to understand what a malicious software is doing and how it has been programmed by its author. Before going into more details, this article will explain some of the general concept and why assembly code is used to do malware analysis.

The right level of abstraction

Computers are generally described with the following six different levels of abstraction. The following list is “bottom-up”, meaning that first levels are less portable.

  1. Hardware: Also known as digital logic. It’s the only physical level. It consists of electrical circuits that implement complex combinations of logical operations such as XOR, AND, OR and NOT.
  2. Microcode: Also known as firmware. This code operates only on the exact circuitry for which it was designed. It contains instructions that translate from higher machine-code level to provide a way to interface with the hardware.
  3. Machine Code: This is opcodes, hexadecimal digits that tell the processor what you want it to do. Machine code is created when a computer program written in high-level language is compiled.
  4. Low-level languages: It is a human-readable version of a computer architecture’s instruction set. Assembly is the most common low-level language. A disassembler is used to generate low-level language text which consist of simple mnemonics such as mov and  jmp. 
  5. High-level languages: This is the level of language used by most programmer (including malware programmer). It provides a strong abstraction from the machine level and make it easy to us programming logic and flow-control mechanisms. High-level languages include C, C++, etc. A compiler is used to generate machine code (e.g. compilation).
  6. Interpreted languages: This is the top and last level. Programmers use interpreted languages such as C#, Perl, .NET and Java. The code is not compiled into machine code but it is translated into bytecode. Such code is executed within an interpreter, which is a program that translate bytecode into executable machine code on the fly at runtime (e.g. Just-In-Time compilation). An interpreter provides an automatic level of abstraction when compared to traditional compiled code, because it can handle errors and do memory management.

As you can imagine, when doing an analysis of a malware, it will be more easier to have a binary from the machine code level rather than the source code! This is why a malware analyst is using a disassembler to generate low-level code and start his/her analysis.

Here is an example, taken from the excellent and must-read book: Practical Malware Analysis by Michael Sikorski and Andrew Honig ( http://nostarch.com/malware ):

Example of Code Level

Computer Architecture – Von Neumann

Von Neumann was a mathematician who describe in 1945 what is still used as the computer architecture in most of the recent computers. In a publication named:  First Draft of a Report on the EDVAC, he described and explained (e.g. definition from the Wikipedia page for Von Neumann architecture) the following:

[…] a design architecture for an electronic digital computer with subdivisions of a processing unit consisting of an arithmetic logic unit and processor registers, a control unit containing an instruction register and program counter, a memory to store both data and instructions, external mass storage, and input and output mechanisms.

So why is this important for us? Well apart of being interesting to see that a publication made in 1945 is the based of computer in 2012, it helps us understand how the main part of a computer are working together. In particular the memory and the processor.

We have a control unit, who gets instructions to execute from the RAM using a register (e.g. the instruction pointer), which stores the address of the instruction to execute. A register is the CPU basic data storage unit. There is several different registers and they are often used to save time so the CPU doesn’t need to access RAM. The ALU (e.g. arithmetic logic unit) executes an instruction fetched from the RAM and places the results in registers or memory. When a program is running, it fetches and executes instructions after instructions.

Sounds easy isn’t it? Well you will see that it’s not that easy but as soon as you understand where the data are stored in the memory, what kind of instructions are used, how to understand them, etc. It will be easier to understand a program (malicious or not) flow and how everything is linked together.

In the next article, I’ll explain in more details the RAM sections and the key instruction.

0 comments

Leave a Reply