Pages

January 11, 2013

So I Wrote A Disassembler

It was the middle of the Texas summer and I had just returned from India back to USA and had absolutely no work to do since I had not registered for any courses in the summer. Back in India, my mentor suggested that I try my hand at writing a disassembler as a first step to get into the realm of low-level system code, OS internals and system software. So that's exactly what I did. I wrote a disassembler for 32bit windows PE files with support for integer, floating-point, MMX, SSE1 and AES instructions :D

Let me tell you, it was not an easy start. I had no clue where to even start. So I started reading up on the format of PE files. Found a very good resource online for this: Matt Pietrek's Explanation of the PE file format MSDN_Mag-Feb2002. I got stuck at one point trying to learn how to find where the code section actually begins in the binary. This article helped a lot in solving my doubts. That link also has some amazing work on x86 assembly and reverse engineering. One must also know about memory mapped files. It comes in very handy when reading files in which you must read data randomly instead of in a sequential way. Memory mapping is achieved with the help of the OS. What it does is it maps the file from the hard disk to virtual memory pages to the process that maps the file. Now the user process can access the file contents just the way it accesses memory locations via a pointer.

I then started reading the Intel SW Developers Manuals which gives a huge amount of information on the processor architecture and more importantly the Instruction Set Architecture (ISA). Volume2 describes the instruction format that is used by the Intel processors and detailed encoding of each and every assembly instruction. Since instructions in the x86 ISA has variable encoding lengths it takes some effort to achieve a successful disassembly. When going through the ISA, one might think that the instructions have random opcodes without any particular format to it but that is far from the truth. Take a look at this document which graphically depicts the opcodes of all instructions. Even though the instructions are encoded into different sizes, it is easy to disassemble using the general instruction format given below.

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 | [Prefix] |  Opcode  | [ModR/M] |  [SIB]  |   [Disp]   |   [Imm]    |
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     <= 4   | 1/2byte  |  1byte   | 0/1byte | 1/2/4bytes | 1/2/4bytes |
    1 byte  |
     each   |

Using the above format, we can start processing at the very first byte of the code section by checking whether it is a prefix/opcode and moving on from there. The state machine below gives a more accurate picture about the disassembly process. Links to source code & binaries of my disassembler is posted in the Code Section. Look at the code and you will get a better understanding of the whole thing.

The ERROR state may be entered from any other state because of disassembly errors like starting disassembly from the middle of an instruction or may be an instruction that is not supported yet. If you look at the source code, there are about 9k LOC (with comments) for just the disassembly engine. This may sound too much but it is only because of different instructions. The logic is the same in processing all instructions(opcode handlers). Only thing is to determine which state to go to next by looking at the current information. There are a lot of if-else conditions to be checked because what the next byte means depends on what has been processed until now. The main part of disassembly, the opcode processing, is easily done using jump tables - I have used an array of function pointers that store addresses of the opcode handler functions and I can call the appropriate function by a simple table lookup when the opcode byte is read.

It was a completely involving experience to work on this project, about one and a half months of summer time gone by in a flash of frantic coding! The very first time that I tested the code and saw a small snippet of the disassembler output on the geeky green color console window, I was elated, and eager to do more and complete the project. There is no better way than coding to kill time!