LLVM-powered deobfuscation of virtualized binaries

This work was carried out during an internship at Thalium on the subject of deobfuscation of virtualized binaries.

Context

Obfuscation is the process of deliberately making code difficult to understand in order to hinder its analysis. It is often used in malware to conceal malicious intent and avoid detection.

Various binary obfuscation strategies exist today, including:

Removing comments / symbols
Adding opaque predicates (branches on a constant condition)
Control flow flattening
Virtualization

Virtualization is one of the most popular, and potent¹ forms of obfuscation today.

Various obfuscators including Tigress, Themida and VMProtect offer virtualization. Due to its potency and the high availability of obfuscators, virtualization has unfortunately been used by threat actors and found in numerous malware (source: MITRE).

Here is a rough outline of the architecture of a virtualized obfuscated binary:

In a virtualized binary, the original program gets encoded into a sequence of virtual instructions. The obfuscated program is going to contain these virtual instructions as well as an interpreter. The interpreter is responsible for executing the virtual instructions. It often contains distinctive components:

a VM entry responsible for context switching and initializing the virtual CPU;
a VM exit, also for context switching;
a dispatcher which retrieves the virtual instruction at the virtual program counter (VPC), then sends the control flow to the appropriate handler for the given virtual instruction;
handlers, which encode the semantics of the different virtual instructions.

Virtual instructions can be anything ranging from logical operations (AND, XOR), control flow operations (JUMP, BRANCH) to more complex operations (see Schloegel et. al. Loki: Hardening Code Obfuscation Against Automated Attacks (usenix.org)).

Devirtualization strategy

Manual analysis

If you have ever been led to manually reverse engineering a VM (a program protected with virtualization), you know that it is often a long and complex task. If you haven’t, you can check out our other post kaleidoscope and give it a try yourself!

The intuitive approach to devirtualization is usually to identify the VM context (e.g. where are located the virtual program counter and other virtual registers), reverse engineer each handler, and then create a custom disassembler for this VM. We can then disassemble the virtual instructions and reverse engineer the original program.

This is extremely tedious, mainly because the architecture used in the VM may change in each obfuscated program. Moreover, virtualized binaries can be further obfuscated, making it hard to identify and reverse engineer the handlers. For example, a common hardening strategy is to encrypt the virtual instructions.

Automated analysis

A different approach put forward by Yadegari et al. in their paper A Generic Approach to Automatic Deobfuscation of Executable Code (ieee.org) consists in using a dynamic taint analysis to reconstruct the original program’s CFG. Jonathan Salwan et al. expanded this strategy in Symbolic deobfuscation: from virtualized code back to the original (shell-storm.org) by then leveraging symbolic execution and compiler optimizations to recreate a devirtualized program.

Our devirtualization approach was heavily inspired by these taint-based methods. Given the limited timeframe of the internship, we slightly simplified this approach. We also wanted to experiment with removing the symbolic execution component which is notoriously slow.

For our devirtualization, we perform a dynamic taint analysis on the obfuscated function, tainting its parameters. We then split an execution trace of the obfuscated binary each time we encounter a tainted conditional instruction in the CFG (branches, cmove, etc.). Finally, we combine these blocks to reconstruct a CFG.

This approach is not the most robust, but worked well for the Tigress obfuscated binaries we tested against (this will be further detailed in the Results section).

Example

The best way of explaining our approach is through an example.

Let’s consider we have a simple program:

int counter(int input) {
    int i = 0;
    while (i < input) {
        i += 1;
    }
    return i;
}

Its CFG would look like this:

A virtualized version of this program could look like this:

The main elements of a VM are present in the above diagram. Here, the virtual instructions (“opcodes”) are simply the handlers’ colors.

Note that the process of obfuscating this program added additional basic blocks (the hatched ones). These include operations that modify the VPC and the dispatcher.

Taint

Performing a taint analysis on the above obfuscated program reveals that the red block (the branching instruction) is the only tainted instruction (it is the only one which interacts with the user input).

This is a very simplified view: in a more complex program, the taint would get propagated during the execution.

Execution trace

We then collect an execution trace of the obfuscated program with a random input (in this example we picked 2):

These instructions are all consecutive, the arrows are simply to make the trace fit on screen.

Note that the trace depends on the input. Had we picked input -1, the trace would not contain the green block.

Splitting the trace

At this stage, we can split the trace on the tainted CFG instructions (which we determined was the red block).

After splitting, our blocks look like this:

As you can see, the second and the third groups of blocks are identical. We could therefore merge them. Using various heuristics detailed in the complete report, we are able to merge and split blocks to create this final CFG:

Simplifying the CFG

At this point, the CFG is still quite complex with many useless instructions. The next step in the deobfuscation process consists in simplifying the code using compiler optimizations.

Using LLVM

As we just saw, taint-based devirtualization relies on compiler optimizations to simplify the recreated CFG and deobfuscate the program. We decided to use LLVM’s state-of-the-art optimizations.

Instead of simply integrating LLVM at the very end of our deobfuscation pipeline, we decided to perform all our analysis in LLVM IR. We were inspired by another deobfuscator: Saturn, which uses LLVM IR as an intermediate representation to deobfuscate opaque predicates.

First and foremost, this allows us to perform optimizations at the very beginning of our deobfuscation pipeline. This initial round of optimizations allows us to remove simple obfuscation passes allowing further analysis to run faster.

Another important consideration, is that using an IR we are able to easily create a multi-architecture tool. As of writing this article, we were able to devirtualize both amd64 and aarch64 binaries. We are in the process of adding support for their 32-bit counterparts.

Last but not least, using LLVM’s IR gives us access to a whole suite of tools. LLVM’s ecosystem not only includes LLVM’s analysis and optimization passes, but also third party tools such as the symbolic execution engine KLEE. Notably, this also works the other way around: we will be able to contribute to LLVM’s ecosystem with our analysis modules. For example, during this internship, we were led to create a dynamic taint analysis engine for LLVM IR (version 17).

Lifting

An important step is lifting: the process of translating machine instructions to another representation (in this case, LLVM IR).

To do so, we used Trail of Bits’ remill library. Using remill proved to be tricky, but we received some helpful advice from the authors of Saturn detailing not only strategies to lift complete CFGs but also custom LLVM passes and tricks to simplify the generated LLVM IR.

Lifting binaries was a large and complex part of the internship.

Results

During this internship, we were able to use our devirtualization tool on binaries (both amd64 and aarch64) obfuscated with the Tigress obfuscator.

Notably, we were able to partially deobfuscate the first Tigress challenge, and so very quickly.

For example, on our machine (OS: Linux - Fedora 40, CPU: 6-core Intel Core i5-8500, Mem: 15.44 GiB), we partially deobfuscated challenge0000-binary1 in under a second. For reference, Triton claims to fully deobfuscate the same binary in 9.20s.

Here is the obfuscated amd64 challenge0000-binary1:

And here is the deobfuscated LLVM IR partial result (132 lines of LLVM IR):

Partially devirtualized SECRET function from the obfuscated binary: challenge0000-binary1

And here is a C extract after compiling the result with llc then decompiling it with IDA:

v16_0 = (input & 0x222C2AFC) - 0x14582014;
v16_1 = input + 0x1DF2339F * v16_0 + 0x22D2B77C;
V16_2 = (input & 0x140538E4) - 0x5F1E4CE7;
v16_3 =
input + (v16_1 >> ((((unsigned __int8)((input & 0xE4) + 0x19) >> 3) & 7u) - 1) >> 1);
if ((v16_0 ^ v16_1) == V16_2 * v16_3) {
    _mermaid_missing_block();
    JUMPOUT(0x2B2LL);
}

In the C extract, you can see a call to a missing block because we did not visit that execution path. Had we chosen a different input, we could have explored this branch (but not another). To achieve better coverage, we could use symbolic execution to find inputs allowing us to visit various execution paths.

Limitations

Although we achieved promising results regarding speed and modularity, the scope of this internship was still quite limited:

we only considered deobfuscation of pure functions with no calls;
we only deobfuscate a single execution path;
our deobfuscator struggles with certain loops.

Complete report

If you found this work interesting, we have decided to make the full internship report publically available here. The report contains additional technical details, a discussion on limitations as well as our future plans.

I am continuing with a PhD at Thales, so stay tuned for further developments. If you have any questions feel free to reach out!

Potency is the measure of how much obfuscators render code unintelligible. ↩︎