Backdoor support for Control-Transfer Breakpoint Features (branches) in x64 version of Microsoft Windows (AMD64, Intel EM64T) The story of a coincidence. article written by Feryno on 2007 October 29 article published on 2007 October 30 at http://x86asm.net and http://fdbg.x86asm.net/article_DebugCtlMSR_backdoor_in_win64.txt some fragments of info posted on 2007 October 29 at internet forums http://board.flatassembler.net/topic.php?p=63948#63948 http://www.openrce.org/blog/view/535/Branch_Tracing_with_Intel_MSR_Registers working project fdbg001A.zip finished on 2007 October 28 As it is written in manuals, both AMD64 and Intel EM64T CPUs support Control-Transfer Breakpoint Features. It was more than 2 years ago when I discovered these fields at the end of CONTEXT: ntddk.h typedef struct DECLSPEC_ALIGN(16) _CONTEXT { ... ULONG64 DebugControl; ULONG64 LastBranchToRip; ULONG64 LastBranchFromRip; ULONG64 LastExceptionToRip; ULONG64 LastExceptionFromRip; } CONTEXT, *PCONTEXT; My first thought was that Microsoft used these variables for some purposes. But I didn't found any way how to made OS to fill them with usefull data. They were allways all zeros. So my second thought was that they would be used in a feature and now they are only reserved in CONTEXT structure. I didn't know which of these 2 possible predictions was true and which false. If the first one was true, then it would be only a question of time when somebody find the way how to use them. If the second one was true, it would be again only a question of time when Microsoft implements them for usage. So my first approach was successfully done by making a driver to read/write MSR registers. I also realized, that these 4 MSRs: LastBranchFromRip, LastBranchFromRip, LastExceptionFromRip, LastExceptionToRip are read-only and thus can't be written back from CONTEXT into CPU as OS switches tasks. I didn't know whether OS saves them at all. I supposed that the best moment when to save them is just when the thread being debugged generates an exception. But success was delayed a bit until I realized that OS often generates int0E (exception page-fault) when manipulating pages (loading them from swap device). It was also clear that the best moment when to save the 4 MSRs is just when entering exception handler (as early as possible) because the registers changes at any branching instruction (so I had to avoid e.g. call instruction before save them). At the end I had a working driver which hooked exceptions (interrupts 00-1F) and every generated exception saved the 4 MSRs into an internal buffer in the driver. The problem was that sometimes a page-fault occured as OS loaded pages from swap between saving MSRs and reading saved values from the driver. The second problem was to find the thread which caused exception = find the owner of the saved registers. For a thread being debugged it didn't matter as it often generated exceptions (e.g. breakpoint, single step). But loading a page from swap rarely overwrote the saved values with new ones between the moment of exception from program being debugged and the moment when debugger read them from saved buffer of the driver. The third problem was Patchguard which checked the kernel integrity randomly every 5-10 minutes and often rebooted my testing PC with: BugCheck 109, {a3a03a387918c925, b3b746becb988153, fffff8000010b070, 2} The Bugcheck Analysis looked like: CRITICAL_STRUCTURE_CORRUPTION (109) This bugcheck is generated when the kernel detects that critical kernel code or data have been corrupted. There are generally three causes for a corruption: 1) A driver has inadvertently or deliberately modified critical kernel code or data. See http://www.microsoft.com/whdc/driver/kernel/64bitPatching.mspx 2) A developer attempted to set a normal kernel breakpoint using a kernel debugger that was not attached when the system was booted. Normal breakpoints, "bp", can only be set if the debugger is attached at boot time. Hardware breakpoints, "ba", can be set at any time. 3) A hardware corruption occurred, e.g. failing RAM holding kernel code or data. Arguments: Arg1: a3a03a387918c925, Reserved Arg2: b3b746becb988153, Reserved Arg3: fffff8000010b070, Failure type dependent information Arg4: 0000000000000002, Type of corrupted region, can be 0 : A generic data region 1 : Modification of a function or .pdata 2 : A processor IDT 3 : A processor GDT 4 : Type 1 process list corruption 5 : Type 2 process list corruption 6 : Debug routine modification 7 : Critical MSR modification At least 1 of reported MSR was allways local address in kernel mode space (the start address of corresponding exception handler), so I started to play with local kernel debugging (only local kernel debugging as I don't have two PCs at close distance to connect them to do remote debugging). Fortunatelly, I discovered this fragment of kernel code: fffff80001041628 mov rax,dr6 fffff8000104162b mov rdx,dr7 fffff8000104162e mov [rcx+0x40],rax ; save DR6 fffff80001041632 mov [rcx+0x48],rdx ; save DR7 fffff80001041636 xor eax,eax fffff80001041638 mov dr7,rax ; zero DR7 fffff8000104163b cmp byte ptr gs:[000007bd],0x1 fffff80001041644 jnz fffff800010416b0 fffff80001041646 test dx,0x300 ; test DR7.LE, DR7.GE fffff8000104164b jz fffff800010416b0 ; skip saving branches registers if none of above DR7 bits is set fffff8000104164d mov r8,rcx ; save pointer to data into r8, because ecx will be required for value of MSR register fffff80001041650 mov ecx,0x1db ; LastBranchFromIP fffff80001041655 rdmsr fffff80001041657 mov [r8+0x88],eax fffff8000104165e mov [r8+0x8c],edx fffff80001041665 mov ecx,0x1dc ; LastBranchToIP fffff8000104166a rdmsr fffff8000104166c mov [r8+0x80],eax fffff80001041673 mov [r8+0x84],edx fffff8000104167a mov ecx,0x1dd ; LastExceptionFromIP fffff8000104167f rdmsr fffff80001041681 mov [r8+0x98],eax fffff80001041688 mov [r8+0x9c],edx fffff8000104168f mov ecx,0x1de ; LastExceptionToIP fffff80001041694 rdmsr fffff80001041696 mov [r8+0x90],eax fffff8000104169d mov [r8+0x94],edx fffff800010416a4 mov ecx,0x1d9 ; DebugCtlMSR fffff800010416a9 rdmsr fffff800010416ab and eax,0xfffffffc ; disable DebugCtlMSR.LBR, DebugCtlMSR.BTF fffff800010416ae wrmsr fffff800010416b0 ret This code fragment gave me a hope that OS saves MSRs somewhere to be later transfered into thread CONTEXT. I was also fighting against another problem. It was debug exception (int01). This exception clears DebugCtlMSR.LBR, DebugCtlMSR.BTF (as well as the CPU clears rflags.TF, DR7.GD just when switching from the thread generating debug exception to the debug exception handler). So my driver reenabled one or both bits (depending on request) in DebugCtlMSR at the end of the new hooked routine for int01. This had a disadvantage that the bits were enabled for any (and thus unknown) thread to be executed by the CPU. The other problem were multi CPU systems, where I had to set both bits for all CPUs in the system. As a conclusion, I had a relatively well working proof of concept which wasn't completely perfect, but it usually worked (most of time correctly, more than 99%). For it to be safe, I had to reboot OS, hook exceptions and do debugging until 5 minutes expired (safe interval to avoid reboot by Patchguard). Very rarely (less than 1%), all 4 branches MSRs were overwritten with useless addresses when OS loaded a page from swap between saving registers into a buffer (exception handler) in the driver and transfering them from the driver into the debugger (reading saved data from the driver). Fortunatelly and luckily, I dicovered this code fragment from kernel: fffff80001041581 mov rdx,[rcx+0x48] ; get value to be written into DR7 fffff80001041585 xor eax,eax fffff80001041587 mov dr6,rax fffff8000104158a mov dr7,rdx fffff8000104158d cmp byte ptr gs:[000007bd],0x1 fffff80001041596 jnz fffff800010415c2 fffff80001041598 test dx,0x200 ; test DR7.GE (bit 9.) fffff8000104159d jz fffff800010415a2 fffff8000104159f or eax,0x2 ; bit 1. of eax = DR7.GE fffff800010415a2 test dx,0x100 ; test DR7.LE (bit8.) fffff800010415a7 jz fffff800010415ac fffff800010415a9 or eax,0x1 ; bit 0. of eax = DR7.LE fffff800010415ac test eax,eax fffff800010415ae jz fffff800010415c2 fffff800010415b0 mov r8d,eax ; save eax into r8d fffff800010415b3 mov ecx,0x1d9 ; DebugCtlMSR fffff800010415b8 rdmsr fffff800010415ba and eax,0xfffffffc ; mask off DebugCtlMSR.LBR, DebugCtlMSR.BTF (bits 0., 1.) fffff800010415bd or eax,r8d ; set DebugCtlMSR.LBR, DebugCtlMSR.BTF according to DR7.LE, DR7.GE fffff800010415c0 wrmsr ; write the value back to DebugCtlMSR fffff800010415c2 ret What does the code do? It sets debug registers (sure to the thread just before switching to it). Then it checks some bits in DR7 and sets two bits in DebugCtlMSR according the two bits in DR7. Strange... But I very soon realized how much clever was the programmer who implemented this revolutionary idea! The programmer surely had a thoughts something like: DebugCtlMSR.LBR, DebugCtlMSR.BTF, DR7.GD, rflags.TF bits are cleared when entering debug exception handler (int01). Rflags.TF can be easily restored because the image of rflags register just before entering debug exception handler is pused in the stack when an exception generates. Restoring DR7.GD isn't any problem either, its setting before triggering a debug exception is known, it was set when debug exception was generated as a general detect = accessing debug register, and when entering debug exception handler the DR6.BD bit is set to reflect DR7.GD bit setting before triggering debug exception). DR7.LE and DR7.GE (bits 8., 9., Local/Global Exact Breakpoint Enabled) are both ignored by implementations of the AMD64 architecture, as it is written in manual (24593.pdf from www.amd.com, chapter 13.1.1 Debug-Control Register DR7), because all breakpoint conditions, except certain string operations preceded by a repeat prefix, are exact. These bits aren't cleared when entering debug exception handler. DebugCtlMSR.LBR, DebugCtlMSR.BTF are destroyed when entering debug exception handler. DR7.LE, DR7.GE aren't destroyed. DR7.LE, DR7.GE aren't implemented for anything which makes sense. DR7.LE and DR7.GE bits were used years ago in older CPUs, but they are still in CPUs and they are free now... So the revolutionary idea of the programmer was certainly: Let's use DR7.LE and DR7.GE as shadows (aliases) for DebugCtlMSR.LBR, DebugCtlMSR.BTF So the programmer did. The other benefit of this 'hack' was: as debug registers are specific for a thread (every thread has its own debug registers which are reloaded when switching to the thread) we can reenable branches recording / stepping on branches only for specific thread(s), so other threads don't interfere and it doesn't matter at which CPU the thread executes in multi CPU systems. My old driver set DebugCtlMSR for all threads in the system on all CPUs which is not so much desireable. Doing it only for thread(s) being debugged is the best choice. I really don't know, why Microsoft didn't make this information publicly available. The information can't be abused for any malware. The question is whether 32-bit windows does the same as x64 version. I hope the clever trick won't disappear in newer versions of Windows. Currently, win XP x64, win2003 x64, Vista x64 support it perfectly. This hidden backdoor just waited to be discovered. So let's enjoy the new knowledge. update from 2009 October 02: Backdoor for branching support works for Intel CPUs at win2008 server x64 R2 kernel (the same kernel as windows 7).