Recognizing code

At the highest levels of abstraction, computer science is pure math. One can write proofs about the behavior of a program, including proofs of correctness, algorithmic complexity, and other things. At the lowest levels of computer engineering, we see physics applied to understand and plan the behavior of silicon at the quantum level. What’s in the middle? In many levels, we see more art than science, and this is particularly true of assembly language programming. Over time, programmers who delve into this abyss learn certain “tricks of the trade” that are difficult to teach outside of experience. One of the skills I’ve learned is “recognizing code. That is, seeing some hex bytes and recognizing them as code, data, or garbage.

Some of the tricks are easy, for instance, if you ever see “CC CC CC …” you can be nearly certain that you are looking at code. Why? 0xCC is a single byte x86 opcode representing “int 3”, which is the “breakpoint interrupt”. This one byte interrupt encoding is very useful when setting “soft breakpoints” (as opposed to hardware breakpoints) where a single byte of an instruction stream can be changed to CC to insert a breakpoint.

Sometimes it requires a bit of knowledge of the instruction encodings to make inferences. By default, most x86 instructions (both 32-bit and 64-bit) default to 32-bit operand sizes. In 64-bit code, this can be changed to 64-bit operands, as well as extending the number of registers available, through the use of the “REX” prefixes. These prefixes can extend the size and range of the operands being encoded. The commonality of the REX prefixes, is that they are in the range of 0x40-0x4F. Here’s a random snippet of ntdll, which I grabbed with windbg:

0:000> db .
00000000`778e2e85 90 e9 af ef fd ff 66 44-39 2f 0f 84 bd ef fd ff ......fD9/......
00000000`778e2e95 48 8d 94 24 90 00 00 00-4c 8b c7 48 8b cf 4d 89 H..$....L..H..M.

We see 0x4* show up a number of times in this sequence, and sure enough, every single one of these is a REX byte:

00000000`778e2e85 90 nop
00000000`778e2e86 e9afeffdff jmp ntdll!LdrQueryImageFileKeyOption+0x4da
00000000`778e2e8b 6644392f cmp word ptr [rdi],r13w
00000000`778e2e8f 0f84bdeffdff je ntdll!LdrQueryImageFileKeyOption+0x4f2
00000000`778e2e95 488d942490000000 lea rdx,[rsp+90h]
00000000`778e2e9d 4c8bc7 mov r8,rdi
00000000`778e2ea0 488bcf mov rcx,rdi

This trick only works on 64-bit code, however, since on 32-bit code those code bytes map to the single byte versions of the “INC” and “DEC” instructions.

Looking at code bytes is well and good, but usually we have a disassembler handy to look at the actual instructions. Here’s a longer section of ntdll:

00000000`778e2e86 e9afeffdff jmp ntdll!LdrQueryImageFileKeyOption+0x4da
00000000`778e2e8b 6644392f cmp word ptr [rdi],r13w
00000000`778e2e8f 0f84bdeffdff je ntdll!LdrQueryImageFileKeyOption+0x4f2
00000000`778e2e95 488d942490000000 lea rdx,[rsp+90h]
00000000`778e2e9d 4c8bc7 mov r8,rdi
00000000`778e2ea0 488bcf mov rcx,rdi
00000000`778e2ea3 4d89afe0020000 mov qword ptr [r15+2E0h],r13
00000000`778e2eaa e8d1770900 call ntdll!EtwDeliverDataBlock+0x220
00000000`778e2eaf 90 nop
00000000`778e2eb0 e99deffdff jmp ntdll!LdrQueryImageFileKeyOption+0x4f2
00000000`778e2eb5 8b0dd57a0d00 mov ecx,dword ptr [ntdll!LdrSystemDllInitBlock+0xb0]
00000000`778e2ebb f6c103 test cl,3
00000000`778e2ebe 7431 je ntdll!RtlIsDosDeviceName_U+0x94d1

We can see a number of signs that immediately tell us that this is real code. For a first step, we see a number of comparisons followed immediately by a conditional jump. Also, we see a call preceded by setting rcx/r8/rdx. On x64, parameters are passed in rcx, rdx, r8, and r9. Finally, we see a number of jumps/calls to locations within the same module, which we can see by the module!export+offset where the offset is relatively small. This one can be misleading at times, however, since a jump with an 8-bit relative offset will almost always be within the same module, and might look valid even if it isn’t.

Being able to recognize code can be a useful skill. When debugging without source code, it’s often useful to be able to tell if a destination address (through an indirect call for instance) is point to real code. When writing an emulator/disassembler, it’s useful to see if the instruction boundaries have been successfully determined. When looking at a crash, it’s useful to see if real code is executing, or if the instruction pointer is somehow pointing to a data section (through stack corruption or a bad indirect jump). There are often other ways to determine if a piece of memory is actually code, but there are times when these methods fail, such as dynamically generated code (for instance, from a jitter or from a buffer overflow attack). I’ve even seen bugs that we were able to solve quickly by realizing that executable code had somehow been loaded into a register!

Leave a Reply

Your email address will not be published. Required fields are marked *