Introduction to x86 Assembly Language – Part II

After the (very) high-level introduction of the part I, we are going to start to go a little bit deeper in the subject. Let’s start by having a closure look to the memory (RAM) and the related registers (e.g. general purpose registers, segment registers, EFLAGS and EIP). The goal is not to make a complete overview of how RAM is working, but keep in mind the overall objective: malware analysis. Therefore we won’t review everything and we will reduce the scope to x86 (32 bits) architectures (as it’s the most common architecture).

RAM

The RAM can be divided in four main sections:

  • Data section: Contains values that are put in place when a program is initially loaded. These values are sometimes called static values because they may not change while the program is running, or they may be called global values because they are available to any part of the program.
  • Code section: Includes the instructions fetched by the CPU to execute the program’s tasks. The code controls what the program does and how the program’s tasks will be orchestrated.
  • Heap section: The heap is used for dynamic memory during program execution, to create (allocate) new values and eliminate (free) values that the program no longer needs. The heap is referred to as dynamic memory because its contents can change frequently while the program is running.
  • Stack section: The stack is used for local variables and parameters for functions, and to help control program flow.

Yes, your guess is correct! The stack is normally the section used when exploiting buffer overflow and the heap can be used for the heap-spray attack as well. There is no specific order, so there is no guarantee that the stack is lower than the heap, etc.

Registers

A register is a small amount of data storage on the CPU. As it is “in” the CPU, the access speed is high. Intel x86 processor have several registers:

  • General-purpose registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP are provided for holding the following items:
    • Operands for logical and arithmetic operations;
    • Operands for address calculations;
    • Memory pointers.
  • The segment registers (CS, DS, SS, ES, FS, and GS) hold 16-bit segment selectors. A segment selector is a special pointer that identifies a segment in memory. To access a particular segment in memory, the segment selector for that segment must be present in the appropriate segment register.
  • The 32-bit EFLAGS register contains a group of status flags, a control flag, and a group of system flags.
  • The instruction pointer (EIP) register contains the offset in the current code segment for the next instruction to be executed

Some more details about General-purposes registers

Alternate General-Purpose Register Names

This deserve some more explanations. All the above registers are 32 bits in size and can be referenced in 32 or 16 bits in assembly code. So EAX is the full 32-bit register, AX is the “lower” 16-bits of the EAX register. EAX (same for EBX, ECX and EDX registers) can be referenced but their 8 bit values. So AL is the “lowest” 8 bits and AH is the second set of 8 bits of the EAX register.

So what’s the purpose of all those registers:

  • EAX — Accumulator for operands and results data;
  • EBX — Pointer to data in the DS segment;
  • ECX — Counter for string and loop operations;
  • EDX — I/O pointer;
  • ESI — Pointer to data in the segment pointed to by the DS register; source pointer for string operations;
  • EDI — Pointer to data (or destination) in the segment pointed to by the ES register; destination pointer for string operations;
  • ESP — Stack pointer (in the SS segment). This register should not be used for another purpose!

Some more details about the EFLAG registers

During execution, each flag is either set (1) or cleared (0) to control CPU operations or indicate the results of a CPU operation. The above image gave more details.

EFLAGS Register

Well that sounds like a lot of details to remember, let’s reduce the scope a little bit. This is where you bring in the “malware analysis” scope. So let’s keep focusing on the above flag:

  • Carry Flag (CF – bit 0): The carry flag is set when the result of an operation is too large or too small for the destination operand; otherwise, it is cleared.
  • Zero Flag (ZF – bit 6): Set if the result is zero; cleared otherwise.
  • Sign Flag (SF – bit 7): 0 indicates a positive value and 1 indicates a negative value (or in other words: set equal to the most-significant bit of the result, which is the sign bit of a signed integer).
  • Trap Flag (TF – bit 8): Set to enable single-step mode for debugging; clear to disable single-step mode.

So the CF, ZF and SF (and other not mentioned above) are use to indicate the results of arithmetic instructions such as ADD, SUB, MUL and DIV instructions.

EIP: the Instruction Pointer

EIP, also known as the instruction pointer, is the register that contains the memory address of the next instruction to be executed by a program. EIP’s only purpose is to tell the processor what to do next.

The EIP register cannot be accessed directly by software; it is controlled implicitly by control-transfer instructions (such as JMP, Jcc, CALL, and RET), interrupts, and exceptions. The only way to read the EIP register is to execute a CALL instruction and then read the value of the return instruction pointer from the procedure stack. However, the EIP can be corrupted!

This is actually exactly what is happening when you exploit a buffer-overflow vulnerability. If you manage to have the control of the EIP, you can control the next instruction to be executed by the processor. You can therefore, and this is what attacker are doing, redirect the EIP to malicious code section that already stored in the memory! Then bingo, you can do (more or less) what you want!

Assembly Syntax: AT&T and Intel

There is actually two main syntaxes: AT&T and Intel. As you can easily imagine, there is some differences in order to make our life…more confusing. So The AT&T syntax is the following:

  • the use of % symbol to prefix all register name
  • the use of $ as a prefix for literal constant (e.g. immediate operands)
  • the operand ordering is the following: the source operand appears as the left-hand operand and the destination operand appears on the right.
The GNU Assembler (GAS) and many other GNU tools, including gcc and gdb, utilise AT&T syntax.

Adding four to the EAX register with AT&T syntax looks like:

 ADD $0x4,%EAX 
The Intel syntax don’t use prefixes and the operand order is reversed (the source operand appears on the right and the destination appears on the left):

ADD EAX, 0x4 

The Intel syntax is used by Microsoft Assembler (MASM), Borland’s Turbo Assembler (TASM), and the Netwide Assembler (NASM).Note: If not said otherwise, we will refer to the Intel syntax throughout all posts on this website.

Assembly Instructions

An instruction is usually made of the following:

  • One mnemonic: a word that identify the instruction to execute.
  • Zero or more operands: it’s a way to identify what information the instruction is going to use (e.g. register, data, etc).

If you refer to the above example (adding four to EAX):

  • mnemonic: ADD
  • Destination operand: EAX
  • Source operand: 0x4

You can find a whole list of instructions here: http://en.wikipedia.org/wiki/X86_instruction_set

Conclusion

That’s a lot to read and to understand. It will give you some additional milestones in order to understand assembly code. The last part will focus on the typical instructions and how to map them against common programming functions (if, for loop, while, etc.).

Source:

  • Intel® 64 and IA-32 Architectures Software Developer’s Manual
  • The Art of Assembly Language, Second Edition, Randall Hyde (March 2010)
  • Practical Malware Analysis, Michael Sikorski and Andrew Honig (2012)
  • Malware Analyst’s Cookbook, Michael Hale Ligh, Steven Adair, Blake Hartstein and Matthew Richard (2011)
  • Wikipedia
  • Google is your best friend…

0 comments

Leave a Reply