2nd FASM Technical Discusion, Brno, 2007 August 25th

Debugging in Long Mode - AMD64

by Feryno


References:

24593.pdf from www.amd.com
www.kernel.org (Linux kernel)
man ptrace (Linux help)
msdn.microsoft.com
self-mistakes and a lot of years spending by debugging because of them...


While debugging, we are playing with an executable program. We can stop it, change its memory/registers when it is stopped, step it, resume its execution.
CPU executes code very quickly. During debugging, we can execute code at the speed observable by human senses (sight).
For playing the game, we need another program - debugger.


Why programmers need debugging?
1. find bugs (critical errors causing program crash)
2. find mistakes in procedures giving wrong unexpected result
3. improve procedures by exploring them using stepping through instructions and watching registers/memory changes
4. unknown executable discovered in system
(5. learning what instructions do - suitable especially for beginners. I do it very often too - instead of reading manuals.)


Debugging is possible thanks to CPU features called "exceptions".


First 32 interrupts (int00h-int1Fh) are reserved for exceptions. Exceptions behave very similarly to interrupts - every exception forces interrupting the program execution and control is transfered from the currently-executing program to the routine handling the interrupting exception. These routines are part of OS kernel and they have names "exception handlers". During the control transfer to the exception handler, the CPU stops execution of the program and saves its return instruction pointer (RIP), stack pointer (RSP), flags register (RFLAGS). The handler is responsible for saving the remaining state of the interrupted program (GPR, XMM, ...). Saving registers allows the CPU to restart the interrupted program after the handler finishes exception handling.


Most exceptions means an occurrence of a "degenerated" instruction/code in the program - in this case, exception boundary is reported before the instruction causing the exception, the interrupted instruction isn't allowed to complete. These exceptions are called "faults".
For the life to be more complicated, the reported instruction pointer lies sometimes on the address of the following instruction, so the boundary is reported after the instruction causing the exception, and the execution of the instruction causing the exception is allowed to complete. These exceptions are called "traps". The benefit of these traps for our life is that they are the core of debugging.


exception divide by zero
triggers int00 vector
samples:

mov rcx,0
div rcx ; divisor is 0

mov rdx,3
mov rax,0
mov rcx,2
div rcx ; result (quotient) is bigger than the capabilities of the destination register


exception single step
triggers int01 vector
samples:

icebp ; db 0F1h

pushfq
or qword [rsp],1 shl 8 ; set Trap Flag
popfq ; the core method of single stepping (in fact, OS sets this bit in program context and reloads registers when task switching)

lea rax,[trap_instruction]
mov dr0,rax
mov eax,1
mov dr7,rax ; the core method of hardware breakpoints
trap_instruction:

lea rax,[mem_write_addr]
mov dr0,rax
mov eax,10001h
mov dr7,rax
mem_write_addr rb	?


exception breakpoint
triggers int03 vector
samples:

int3
db 0CCh
int 03h
db 0CDh, 03h


exception invalid opcode
triggers int06 vector
samples:

ud2
lea rax,rdx ; the source operand is a register db 8Dh, 0C2h correct is lea rax,[rdx]
(a lot of instructions illegal in long mode...)


double fault
triggers int08 vector


exception stack fault
triggers int0C vector


exception general protection
triggers int0D vector


exception page-fault
triggers int0E vector


exception aligment check
triggers int11 vector

; kernel
mov rax,cr0
or rax,1 shl 18
mov cr0,rax ; AM bit of CR0

; user mode program, stack aligned at qword or dqword
pushfq
or qword [rsp],1 shl 18 ; set AC bit of rflags
popfq
mov eax,[rsp+1]
; occurs only when CPL=3, never occurs when CPL < 3.


Interactions:

        OS
       /  \
program    debugger

1. Program causes an exception.

2. CPU stops the execution of the program, saves instruction pointer, stack pointer, flags of the program and control is given to the corresponding exception handler (= interrupt vector).

3. OS handles the interrupt vector and notifies the debugger about the exception.
Linux64:
mov eax,sys_wait4
syscall
Win64:
call qword [KERNEL32.WaitForDebugEvent]

4. User is allowed to change registers/memory of the program via debugger.
Linux64:
mov edi,PTRACE_GETREGS
mov eax,sys_ptrace
syscall
PTRACE_GETREGS, PTRACE_SETREGS,	PTRACE_PEEKTEXT, PTRACE_POKETEXT, PTRACE_PEEKDATA, PTRACE_POKEKDATA
Win64:
call qword [KERNEL32.GetThreadContext]
GetThreadContext, SetThreadContext, ReadProcessMemory, WriteProcessMemory

5. User can resume execution of the program via debugger.
Linux64:
mov edi,PTRACE_CONT
mov eax,sys_ptrace
syscall
mov edi,PTRACE_SINGLESTEP
mov eax,sys_ptrace
syscall
Win64:
call qword [KERNEL32.ContinueDebugEvent]

If the program doesn't cause any exception then the program runs to its end and terminates. In this case, the debugger doesn't encounter any exception, debugger is only notified about program termination at the end. This is a dream of every assembly-coder and desirable terminal stage of developping any program... well not exactly, some procedures may still behave in an incorrect way and give unexpected return values...


"Hardware" Breakpoint
triggers int01 vector
HW BP is done by setting some debug registers.
We need to focus only on debug registers 0, 1, 2, 3, 6, 7
DR4, DR5, DR8, DR9, DR10, DR11, DR12, DR13, DR14, DR15 aren't used. Isn't it a pity? But on the other side, it could be even more complicated...
The debug registers can be read and written only when the current-protection level (CPL) is 0 (most privileged) - kernel.
; CPL=0
mov rax,dr7
mov dr3,rcx

User mode debugger running at CPL=3 can access debug registers of a program when the program is stopped after causing an exception.
Linux64:
mov edi,PTRACE_GETREGS
mov eax,sys_ptrace
syscall
mov edi,PTRACE_SETREGS
mov eax,sys_ptrace
syscall
Win64:
call qword [KERNEL32.GetThreadContext]
call qword [KERNEL32.SetThreadContext]


DR0
DR1
DR2
DR3
64-bit registers hold virtual (linear) address.
lea rax,[address]
mov dr0,rax

If we need to set debug register DR0-DR3, then we must set its conditions in DR7 register - enabled bit, type, lenght.

DR7
bit(s)	mnemonic	description
31-30	LEN3		Length of Breakpoint #3
29-28	R/W3		Type of Transaction to Trap
27-26	LEN2		Length of Breakpoint #2
25-24	R/W2		Type of Transaction to Trap
23-22	LEN1		Length of Breakpoint #1
21-20	R/W1		Type of Transaction to Trap
19-18	LEN0		Length of Breakpoint #0
17-16	R/W0		Type of Transaction to Trap
6	L3		Local Exact Breakpoint #3 Enabled
4	L2		Local Exact Breakpoint #2 Enabled
2	L1		Local Exact Breakpoint #1 Enabled
0	L0		Local Exact Breakpoint #0 Enabled

LEN0-LEN3
00b	1 byte
01b	2 byte, addr in corresp DR0-3 must be word aligned
10b	8 byte, address in DR must be qword aligned
11b	4 byte, address must be dword aligned

R/W0-R/W3
00b	int01 breakpoint on instruction execution, LEN must be 1 byte (LENx = 00b)
01b	int01 occurs only on data write
10b	int01 only on I/O read/write if CR4.DE=1 (bit 3. of CR4) - in, out, insb, outsb
	if CR4.DE=0 this setting is undefined
11b	int01 occurs only on data read or data write

We want to set DR0-DR3 register
DRx, x = 0, 1, 2, 3
lea rax,[address]
mov DRx,rax
mov eax,((lenght*4 + type) shl (x*4 + 16)) +  (1 shl (x*2))
mov dr7,rax

; example 0
; we want to watch reading from or writing into 1 qword at address 100005120h (address range 100005120h-100005127h)
lea rax,[100005120h]
mov dr0,rax
mov rax,dr7
and eax,not ((1111b shl 16) + 11b)	; mask off all
or eax,(1011b shl 16) + 1		; prepare to set what we want
mov dr7,rax				; set it finally
; Done, now we can wait until code falls into the trap ! After accessing any byte at 100005120h-100005127h, int01 will occur and DR6.B0 bit will be set to 1.

; example 1
; we want to watch writing into 8 bytes at address range 40AF31h-40AF38h
; it doesn't work by setting lenght=8 (address isn't aligned at dqword boundary)
; we must set more breakpoints to cover the whole address range
; breakpoint 0. to watch 1 byte at 40AF31h
; breakpoint 1. to watch 1 word at 40AF32h-40AF33h
; breakpoint 2. to watch 1 dword at 40AF34h-40AF37h
; breakpoint 3. to watch 1 byte at 40AF38h
mov rax,dr7
and eax,0000FF00h	; mask off all
lea rdx,[40AF31h]
mov dr0,rdx
or eax,(0001b shl 16) + 1
lea rdx,[40AF32h]
mov dr1,rdx
or eax,(0101b shl 20) + 100b
lea rdx,[40AF34h]
mov dr2,rdx
or eax,(1101b shl 24) + 10000b
lea rdx,[40AF38h]
mov dr3,rdx
or eax,(0001b shl 28) + 1000000b
mov dr7,rax

; example 2
; we want to break on the execution of the instruction at 401235h
; note: the instruction must start exactly at this address
; if the set address lies somewhere inside the instruction (instruction has 2 or more bytes) then int01 won't occur !!!
lea rax,[401235h]
mov dr0,rax
mov rax,dr7
and eax,not ((1111b shl 16) + 11b)	; mask off all
or eax,(0000b shl 16) + 1
mov dr7,rax

; example 3
; we want to watch reading from or writing into ports 20-27h	 	(kernel dbg - in, out, insb, outsb)
mov rax,cr4
or rax,1 shl 3	; CR4.DE bit 3. on (Debugging Extensions)
mov cr4,rax
mov eax,20h
mov dr3,rax
mov rax,dr7
and eax,not ((1111b shl 28) + 11000000b) ; mask off all
or eax,1010b shl 28 + 01000000b		 ; LEN3=10b (8 bytes), R/W3=10b (I/O)
mov dr7,rax

The condition which caused an int01 exception is recorded in the DR6 debug-status register.
DR6
bit	name	event
14	BS	Single Step (rFLAGS.TF has been set)
13	BD	Breakpoint Debug Access Detected				(DR7.GD was set)
3	B3	Breakpoint #3 Condition Detected
2	B2	Breakpoint #2 Condition Detected
1	B1	Breakpoint #1 Condition Detected
0	B0	Breakpoint #0 Condition Detected

DR7
bit(s)	mnemonic	description
13	GD	General Detect Enabled
When this bit is set, the debug exception (int01) occurs when an attempt is made to execute a MOV DRn instruction to any debug register (DR0-DR3, DR6, DR7). This bit is cleared to 0 by the processor when the int01 handler is entered, allowing the int01 handler to read and write the DR registers. The int01 exception occurs before executing the instruction, and DR6.BD is set by the processor. Software debuggers can use this bit to prevent the currently-executing program from interfering with the debug operation.

int01_handler:
push rax
mov rax,dr6
bt eax,14
jc single_step_detected
bt eax,13
jc debug_access_detected
test eax,1 shl 3
jnz bp3_detected
test eax,1 shl 2
jnz bp2_detected
test eax,1 shl 1
jnz bp1_detected
test eax,1
jnz bp0_detected
icebp_detected:
...
pop rax
iretq

Instruction execution breakpoint and general-detect condition cause the int01 exception to occur BEFORE the instruction is executed.
All other breakpoint (Data Write Only, Data Read or Data Write, I/O Read or I/O Write) and single-stepping conditions cause the int01 exception to occur AFTER the instruction is executed. If more int01 conditions occur on the same instruction (e.g. repeated operations - REP prefix, like repz movsb), they can breakpoint between iteration.
Databreakpoint conditions on the previous instruction occur before an instruction-breakpoint condition on the next instruction. However, if instruction and data breakpoints can occur as a result of executing a single instruction, the instruction breakpoint occurs first (before the instruction is executed), followed by the data breakpoint (after the instruction is executed).


Single Step
triggers int01 vector

Single-step breakpoints are enabled by setting the rFLAGS.TF bit to 1. When single stepping is enabled, an int01 exception occurs after every instruction is executed until it is disabled by clearing rFLAGS.TF to 0. The instruction that sets the TF bit, and the instruction that follows it, is not single stepped.

pushf
or dword [rsp],1 shl 8
popf
; rflags.TF=1 now
mov edx,eax
; now int01 occurs for the first time (as execution of mov, mov instruction is allowed to complete, because single step is TRAP type of exception, not FAULT type)
pushf
; now int01 occurs again
and dword [rsp],not (1 shl 8)
; int01 occurs for the third time
popf
; int01 occurs for the forth time, it is the last time, as the execution of popf instruction
; rFLAGS.TF=0 now
mov ebx,ecx	; this doesn't trigger int01 anymore

int01_handler:
; When an int01 exception occurs due to single stepping, the processor clears rFLAGS.TF to 0 before entering the int01 handler, so that the handler itself is not single stepped.
; The processor also sets DR6.BS to 1, which indicates that the int01 exception occurred as a result of single stepping.
push rax
mov rax,dr6
bt eax,14	; DR6. BS
jnc else_than_single_step
single_step_detected:
; The rFLAGS image pushed onto the debug-handler stack has the TF bit set, and single stepping resumes when a subsequent IRETQ pops the stack image into the rFLAGS register.
iretq

Single stepping can be a bit more complicated, we will discuss it later.


"Software" Breakpoint
triggers int03 vector

db 0CCh = int3 ; very useful, 1 byte instruction fits to overwrite the first byte of any other instruction
db 0CDh, 03h ; useless, can't fit into 1-byte instructions (cld; push/pop gpr64; xchg gpr32,eax; stosb; ...)

- incompiled in program at development stage - trick how to go easy and quickly into desired part of program in development
- put in program by debugger

steps when handling SW BP:
1. debugger reads the original byte and saves the original byte and the original address by storing then into an internal buffer
2. debugger replaces the original byte with the byte 0CCh
3. debugger waits until int03 occures
4. int03 handler gets the address just after the executed byte 0CCh
5. debugger calculates internal value X by subtracting 1 from address returned in step 4 (X=RIP - 1)
6. debugger checks its internal buffer if any stored addr matches X
7. if no such address found, it is an instruction int3 incompiled into	program (source of program has int3 instruction, developper must remove it finally)
	jmp end_of_int03_handler
8. if such address found, it was a breakpoint caused by byte 0CCh inserted into the program by the debugger
	restore the original byte at address X
	decrease RIP of the program (RIP - 1 = X)
end_of_int03_handler:
iretq


Other Features (now well know, that's a pity)

We can watch addresses of instructions causing control transfers.

The instructions are:
JMP, CALL, RET, Jcc, JrCXZ, LOOPcc, JMPF, CALLF, RETF, INTn, INT 3, ICEBP, Exceptions, IRET, IRETQ, SYSCALL, SYSRET, NMI, SMI, RSM

We just need to enable 1 bit in 1 register...

The register has the name Debug-Control MSR
DebugCtlMSR	=	01D9h
mov ecx,DebugCtlMSR
rdmsr
or eax,1
wrmsr

DebugCtlMSR
bit	mnemonic	description
1	BTF		Branch Single Step
0	LBR		Last-Branch Record

Setting LBR bit orders the processor to record the source and target addresses of the last control transfer (branch instruction, interrupt, and exception) taken before a debug exception occurs (int01).
The processor automatically disables control-transfer recording when int01 occurs by clearing DebugCtlMSR.LBR to 0. The contents of the control-transfer recording MSRs are not altered by the processor when int01 occurs. Before exiting the debug-exception handler, software can set DebugCtlMSR.LBR to 1 to re-enable the recording mechanism.

After enabling LBR bit of DebugCtlMSR, the source and destination addresses of control-transfer events before the control is given to int01 are saved by the processor - branches (call, jmp), interrupts, exceptions -
LastBranchFromIP,
LastBranchToIP,
LastExceptionFromIP,
LastExceptionToIP.
These 64-bit read-only registers record control branches.

LastBranchFromIP	=	01DBh
LastBranchToIP		=	01DCh
LastExceptionFromIP	=	01DDh
LastExceptionToIP	=	01DEh
mov ecx,LastBranchFromIP
rdmsr
mov dword [x+4],edx
mov dword [x],eax	; qword [x] holds the 64-bit address
x dq ?

DebugCtlMSR.BTF changes the behavior of the rFLAGS.TF bit. When this bit is cleared to 0 (normal, most common setting) rFLAGS.TF bit controls instruction single stepping (normal behavior). When this bit is set to 1, the rFLAGS.TF bit controls single stepping on control transfers (branch instruction, interrupt, exception) - single step doesn't occur on every instruction, but only on control transfers ("bigger single steps"). By this way the single-step mechanism is allowed to do single step only on control transfers, rather than single step every instruction.
Debuggers can use this capability to perform a "coarse" single step across blocks of code (bound by control transfers), and then, as the problem search is narrowed, switch into a "fine" single-step mode on every instruction (DebugCtlMSR.BTF=0, rFLAGS.TF=1).


Summarization:


symbols:
address (in binary) -> label (in source)

symbols supported in fdbg:
Linux64:
	ELF64 - DWARF (Debug With Arbitrary Record Format)

Win64:
	exports (as in DLLs) - very useful and easy in FASM
	DBGHELP.DLL - not very useful in FASM, useful in C


breakpoints:
must lie on the begin of the instruction (not inside it !)

"software" breakpoint
int3, db 0CCh
disadvantage - modifies memory of program

"hardware" breakpoint
debug registers - doesn't modify memory of program
advantage - watching reading/writing memory (I/O ports)
disadvantage - only 4 breakpoints


steps:
trace into
step over

usefulness of step over:
rep (repnz scasb, repz movsb, repz stosb)
call
loop


They are 2 groups of assembly programmers who really need debugging.

1. Programmers doing biiiiig boooooring exxxxxhaustive job (the reason of doing a lot of mistakes).

2. Programmers doing a lot of mistakes because they just started to learn assembly. Their improvement is rapidly increasing thanks to debugging.


This is the specimen (sample) of the first type of programmers
DSC00056.JPG


This is the second type of programmers...
DSC00040.JPG