QEMU and U: Whole-system tracing with QEMU customization
Introduction
QEMU is a key tool for anyone searching for bugs in diverse places. Besides just opening the doors to expensive or opaque platforms, QEMU has several internal tools available to enable developer’s further insight and control. Researchers comfortable modifying QEMU have access to powerful inspection capabilities. We will walk through a recent custom addition to QEMU to highlight some helpful internal tools and demonstrate the power of a hackable emulator.
The target was a SoC that had an interesting system spread across multiple processes and libraries. We could communicate with this system from the external network, and we wanted to know the extent of our reach before authentication. Because of the design of the system, it was not simple to track down all the places our influence reached without valid credentials. A better map of that surface area would be helpful for further findings. We had done the prior work to get the target up and running in QEMU, so why not just have the emulator tell us?
Tracing in QEMU
Tracing guest execution in QEMU is not as simple as calling printf(“%p\n”, pc);
for every instruction. The thing that puts the Q in QEMU is the TCG. The TCG (Tiny Code Generator) is a just in time compiler that will translate blocks of guest instructions to code that can run on the host. While it would be simple to trace each new block translated, once they are translated the blocks can run multiple times unimpeded and untracked by QEMU code. If all that is needed is a trace of when each block is translated, there is built-in tracing in QEMU that can give that information. (The event is translate_block
. See the docs for more details.)
Once a block is translated, the emitted code may be used and reused many times. For our target we wanted to be able to start our trace when the system was in a steady state, when many blocks would have already been translated. If we want to trace every time some basic block is executed in the guest, we need to emit our own operations in front of the rest of the translated block.
There are lots of great references we can turn to for emitting custom operations alongside the translated code. QEMU itself can place instructions before each basic block that are used to count the number of instructions executed. We can follow the call to get_tb_start
here in translator.c
, which leads here. Operations to check the instruction count are added so execution can be halted if a limit is reached.
/*...*/
tcg_gen_ld_i32(count, cpu_env,
offsetof(ArchCPU, neg.icount_decr.u32) -
offsetof(ArchCPU, env));
if (tb_cflags(tb) & CF_USE_ICOUNT) {
/*
* We emit a sub with a dummy immediate argument. Keep the insn index
* of the sub so that we later (when we know the actual insn count)
* can update the argument with the actual insn count.
*/
tcg_gen_sub_i32(count, count, tcg_constant_i32(0));
icount_start_insn = tcg_last_op();
}
tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, tcg_ctx->exitreq_label);
The piece of
gen_tb_start
emitting the conditional branch
Thankfully, we do not have to specify individual operations like the icount code does. To simplify things, QEMU can generate “helper” functions which will generate operations to call out to a native function from within the translated blocks. This is what AFL++’s fork of qemu uses for its tracing instrumentation without having to modify the guest binary. AFL++’s qemu has to track unique paths taken, and the code makes for a good example for our use case. The In the AFL++ fork, the function afl_gen_trace
is called immediately before a basic block is translated.
tcg_ctx->cpu = env_cpu(env);
afl_gen_trace(pc);
gen_intermediate_code(cpu, tb, max_insns);
In
tb_gen_code
where afl emits operations to trace execution
They there call gen_helper_afl_maybe_log
, but searching the source we can find no definition for that function. This is a helper function. The build system will create a definition that will emit operations to perform a call to the function HELPER(afl_maybe_log)
.
void HELPER(afl_maybe_log)(target_ulong cur_loc) {
register uintptr_t afl_idx = cur_loc ^ afl_prev_loc;
INC_AFL_AREA(afl_idx);
afl_prev_loc = cur_loc >> 1;
}
AFL++'s trace helper, adjusting a map in shared memory
The function was declared here as DEF_HELPER_FLAGS_1(afl_maybe_log, TCG_CALL_NO_RWG, void, t1)
. QEMU’s build system will handle generating the code to create TCG operations to call the helper function. The “_1” indicates it takes one argument, and the last two arguments to the macro are the return type, and the argument type. tl
indicates target_ulong
. Another helpful argument type is env
which passes an CPUArchState *
argument to the helper function. ptr
, i64
, f32
, and such all do what they say on the tin.
For our tool, we used a helper function to call to call out at the beginning of every code block. In target/arm/translate.c
we added gen_helper_bb_enter(cpu_env, tcg_constant_i32(4))
in the function arm_tr_tb_start
which is called at the beginning of translating every block for an ARM guest. This will generate code for each basic block that will call our function HELPER(bb_enter)
.
This leads us to another problem we encounter when trying to trace such a complex system. On our target, if we implement HELPER(bb_enter)
with fprintf(logfile, “@%p\n”, env->regs[15])
we are quickly going to slow our emulator to a crawl, and be left with huge unreasonable files. In our case, we did not care too much about the order in which these basic blocks were hit, we just cared what basic blocks were uniquely hit when we interact with the system over the network. For this we implemented a form of Differential Debugging.
We had to communicate to QEMU when to start and stop a trace, so we could take separate recordings. A recording of area covered while running the system without interacting with it over the network, and a separate recording of lots of various non-authenticated interaction with the system over the network. We then found the area covered in the second recording that was not covered in the first. Then we had our tool report this as surface area to be further tested and reviewed for vulnerabilities.
To do this we implemented the tracing as a bitmap of the address space we cared about. We adjusted the granularity of our map so that every entry accounted for 0x10
bytes of code, which for our 32-bit arm target produced perfectly manageable file sizes.
// paddr to start watching
#define MAP_START_PADDR 0x80000000
// size of memory region
#define MAP_SIZE 0x20000000
#define MAP_END_PADDR (MAP_START_PADDR + MAP_SIZE)
#define MAP_GRAN_SHF 4 // 0x10 granularity
#define INDX_OFF (MAP_START_PADDR >> MAP_GRAN_SHF)
#define BB_MAP_INDEX(addr) ((addr - INDX_OFF) >> 3)
#define BB_MAP_BIT(addr) (addr & ((1<<3)-1))
unsigned char bb_map[(MAP_SIZE >> (MAP_GRAN_SHF + 3))];
void HELPER(bb_enter)(CPUARMState *env, int blksz)
{
/* ... */
pend = pstart + blksz - 1;
if ((pstart < MAP_START_PADDR) || (pstart >= MAP_END_PADDR)) {
// not in region
return;
}
if (pend >= MAP_END_PADDR) {
pend = MAP_END_PADDR-1;
}
pstart >>= MAP_GRAN_SHF;
pend >>= MAP_GRAN_SHF;
while (pstart <= pend) {
bb_map[BB_MAP_INDEX(pstart)] |= (1 << BB_MAP_BIT(pstart));
pstart++;
}
return;
}
Piece of relevant code for implementing our tracing bitmap
We also had to add some method to communicate to our emulator when to start, stop, clear, or write out a coverage map. QEMU provides a nice way to implement commands such as these in its HMP (human monitor) system. The documentation contains instructions on how to add monitor commands. The basic process involves adding an entry in the hmp-commands.hx
file describing the command names, the arguments expected, and a bit of info about the command. The handler declarations can go in include/monitor/hmp.h
, and the definitions typically go in monitor/hmp-cmds.c
.
We implemented a clear, start, stop, and write command for our tracing.
For many targets this would be enough, and we could move on to writing tooling to convert our coverage information to file offsets. The system we wanted to gather info on was running in usermode code across multiple processes on our target. If we had logged based on the instruction pointer, we would have a trace of virtual addresses across all processes. Most of these virtual addresses are not going to be unique across processes, rendering our system coverage mostly meaningless.
We got around this issue with a bit of a hack. By translating the virtual address of the instruction pointer to a physical address, we can avoid aliasing between processes. This works for the system we were testing because the relevant processes all remained running the whole time. For an extra measure we turned swap off, keeping our pages from moving around underneath us.
This is probably not a Good Idea™ for most tracing use cases, but it worked well for our setup and we were able to implement it quickly. We made use of a function in QEMU called get_phys_addr
that exists for ARM targets. We probably would have been better off using something that made use of the TLB, as the constant translation slowed down the emulator noticeably when our tracing was enabled.
/*...*/
target_ulong start;
hwaddr pstart;
hwaddr pend;
MemTxAttrs attrs = { 0 };
int prot = 0;
target_ulong page_size = { 0 };
ARMMMUFaultInfo fi = { 0 };
ARMCacheAttrs cacheattrs = { 0 };
start = env->regs[15];
// convert to physical address
// >:|
// returns bool, but 0 means success
if (get_phys_addr(
env,
start,
MMU_INST_FETCH,
arm_mmu_idx(env),
&pstart,
&attrs,
&prot,
&page_size,
&fi,
&cacheattrs
)) {
printf("DBG Could not get phys addr for %x\n", start);
return;
}
Our call to
get_phys_addr
to translate the instruction pointer into a physical address
To work with physical addresses, we confined our coverage map to the part of the physical address space that we knew was correlated with RAM. Before and after obtaining our two coverage recordings we took physical memory dumps of our system using the existing QEMU monitor command pmemsave
. To parse the coverage date for unique coverage, we made a small script that evaluated the dumps, the coverage maps, and any relevant binary files. Upon finding bits in the bitmap that are unique to the second recording, the script checks if the memory dumps show this to be in one of the relevant binary files. We cannot do exact matching on the binary files because relocations will have changed the contents, so we simply align the text section and check if it is “near enough” a match. With a good threshold for “near enough” we obtained accurate results. From there we translated the unique bit locations to file offsets and generate coverage data that could be used with IDA, Binary Ninja, or Ghidra.
(Lighthouse is our favorite coverage plugin for IDA and Binary Ninja. Dragon Dance is a good alternative for Ghidra. Lighthouse’s modoff format is very simple to implement. If coverage compatible with Dragon Dance is needed, the drcov format is simple enough to implement, and Qiling framework has some good example code for generating it.)
Conclusion
This solution worked well for our target and gave us some areas to dig into that would have otherwise been difficult to find quickly. The purpose of this post is not to introduce some new fork of QEMU with this tool built in. There are already too many unmaintained forks of QEMU for vulnerability research, and this tool would be a lot less effective in other situations. This is meant as more of a love note to QEMU, and hopefully inspires other researchers to make better use of their favorite emulator. The internals of QEMU made so our tracing tool could be developed quickly, and we could return focus to finding vulnerabilities.