Friday, July 12, 2013

SIGINT CTF 2013: Task 0x90 (300 pts)

The "0x90" task was found in the "reversing" category and was only solved by three teams in the end. The task archive contained two files:

j00ru@xxx:~/sigint/0x90$ ls 
0x90.run xor.bin 

The "xor.bin" file was eight bytes long and contained uninteresting binary data, while "0x90.run" turned out to be a 64-bit statically compiled ELF file of significant size:

j00ru@xxx:~/sigint/0x90$ file 0x90.run
0x90.run: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.24, not stripped
j00ru@xxx:~/sigint/0x90$ du -hs 0x90.run
3.0M    0x90.run


Starting the program on an older machine throws the following error message:

Fatal Error: This program was not built to run on the processor in your system.
The allowed processors are: Intel(R) processors with SSE4.2 and POPCNT instructions support.


Interesting! Repeating the same action on a more recent hardware configuration doesn't seem to yield any evident results - the application successfully starts and extensively consumes CPU resources (using trigonometric functions), but nothing much happens on stdout:

(gdb) r
Starting program: /home/mjurczyk/Downloads/sigint/0x90/0x90.run 
^C
Program received signal SIGINT, Interrupt.
0x000000000040399f in atan.L ()
(gdb)

Lacking ways to interact with a running process, we decided to do some actual reverse engineering at this point. If you load the file up in IDA and take a brief look at the entry point, it is clearly visible that the executable was built with the Intel C++ Compiler (ICC):


Following a cursory analysis, we were able to establish the logic of the challenge and its actual goal. Long story short, the program stores a 64-bit hash throughout its entire lifetime, gradually forming its final value in the following manner:
hash ‹ 0xC23F3048EA749B76
if (ptrace(PTRACE_TRACEME) succeeds) {
  hash++
}
hash = merge_hashes(hash, calculate_hash(strip(argv[0]), strlen(strip(argv[0]))))
for (int i = 0; i < 1000; i++) {
  benchmark()
  hash = merge_hashes(hash, calculate_hash(image_base, image_size))
  hash += open64("/proc/self/status")
}
hash ^= xor.bin file contents

The program would then print out "sigint_" followed by the binary hash value casted to a textual form (i.e. each byte of the hash should be printable at this point, if it is valid). The exact implementations of the "merge_hashes" and "calculate_hash" functions are not relevant at this point; it is only important to note that the first ones `reduces` two 64-bit values into a single one using binary and arithmetic operations, whereas the second one calculates a 64-bit hash value given an input memory area.

In theory, obtaining the flag should be as easy as launching the executable and observing stdout. What makes it an actual challenge is the presence of the benchmark function, which further invokes one of two subroutines, depending on the CPU capabilities: benchmark_kerneldi_W and benchmark_kerneldi_A. In essence, each function were programmed to perform 10.000.000.000.000 (ten trillion) iterations of expensive SSE4.2 operations - something that would never realistically complete within the time frame of the CTF, which is where the problems begin.

There are several important conclusions we can draw here:
  1. we would like the program to complete in reasonable time, i.e. get rid of the time consuming benchmark loop.
  2. we would like the final hash to be equal to one which would be generated with the loop in place, which indicates that:
    1. the authors most likely expect us to use the original filename for the file, we should not change it.
    2. we should be careful attaching a debugger to the program because doing so might affect the output if we're not careful.
    3. also attaching a remote debugger past the ptrace() call is not possible.
    4. the program should use non-modified memory for hash computation, if we decide to make any alterations to its executable code.
    5. all calls to functions which make use of global variables (e.g. srand) are crucial and cannot be ommitted.
    6. file descriptors returned by open64 should be identical to ones returned normally (relevant to gdb, which creates additional descriptors in the target process and thus affects open64 return values).
 Considering the volume of requirements above, it is fairly troublesome to patch the program in a way that emulates the normal execution environment, but still removes the lengthy loop. While it is surely possible to develop such a patch, it is by no means elegant. The perfect solution would be to either:
  • obtain the correct return values of the "calculate_hash" function for argv[0] and program memory during each iteration, and create our own implementation of the final hash calculation, or ...
  • ... remove the loop in a way that does not require modifying the code of the loop itself, i.e. on CPU level.
Note that while changing the semantics of an instruction would typically require an x86 hardware debugger or ability to apply arbitrary microcode updates, there is a much easier way - you could use a CPU emulator, such as Bochs!

As both Gynvael and I had some prior experience with writing Bochs instrumentation (see here and here), I was happy to implement the idea. The next few minutes of development resulted in the creation of the following short code snippet:

#include <stdint.h>
#include <stdarg.h>
#include <time.h>

#include "bochs.h"
#include "cpu/cpu.h"
#include "cpu/instr.h"

#include "instrument.h"

#ifndef RAX
# define RAX pcpu->gen_reg[BX_64BIT_REG_RAX].rrx
#endif  // RAX

#ifndef RBX
# define RBX pcpu->gen_reg[BX_64BIT_REG_RBX].rrx
#endif  // RBX

#ifndef RIP
# define RIP pcpu->prev_rip
#endif  // RIP

void bx_instr_before_execution(unsigned cpu, bxInstruction_c *i) {
  static unsigned int adjustements = 0;

  BX_CPU_C *pcpu = BX_CPU(cpu);
  if (!pcpu->protected_mode()) {
    return;
  }

  if (RAX == 10000000000000LL) {
    RAX = 2;
    fprintf(stderr, "[sigint_0x90] {%u} Special RAX found and adjusted at RIP=%llx, %u\n",
            time(NULL), RIP, ++adjustements);
    fflush(stderr);
  } else if (RIP == 0x402669 && (RBX & 0xffffffff00000000LL)) {
    fprintf(stderr, "[sigint_0x90] {%u} Hash value: %llx\n", time(NULL), RBX);
    fflush(stderr);
  } else if (RIP == 0x4026e9 && RAX == RBX && RAX < 0x10000) {
    fprintf(stderr, "[sigint_0x90] {%u} open64() fd: %llx\n", time(NULL), RAX);
    fflush(stderr);
  }
}

The code would serve three different purposes - nullifying the benchmark loop and displaying information about the static image hash value and open64 syscall return value for each of 1000 external loop iterations. After building Bochs, booting up an Ubuntu 13.04 Server 64-bit guest (we happened to have a Bochs hdd image handy due to unrelated bochspwn project activity) and starting the 0x90.run executable, we could observe the following emulator console output:
While the executable in the guest system was running at around one iteration of the external loop per second, I reverse engineered and rewrote the merge_hashes, calculate_hash and final hash generation code to C++. Once I found that the static image hash is 0x79082a819dc08d7f for every loop iteration (i.e. no static memory of the program changes between) and the open64 numeric file descriptor values start at 4 and increment by one, I ended up with the following implementation:
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <stdint.h>
using namespace std;

uint64_t hash_region(const char *data, uint32_t length) {
  uint64_t h = 0;
  for (uint32_t i = 0; i < length; i++) {
    h = (h << 6) + (h << 16) - h + data[i];
  }
  return h;
}

uint64_t merge_hashes(uint64_t h, uint64_t g) {
  return ((g << 16) - g + (h << 8) + h);
}

int main() {
  const uint64_t kChallengeImageHash = 0x79082a819dc08d7f;
  const uint64_t kXorConstant = 0x6704b2e715d8d012;
  const char filename[] = "/0x90.run";

  // Initial value from: 
  //   mov     rbx, 0C23F3048EA749B76h
  uint64_t hash = 0xC23F3048EA749B76LL;

  // increment for failed ptrace(PTRACE_TRACEME); debugged process.
  // hash++;

  hash = merge_hashes(hash, hash_region(filename, strlen(filename)));

  // 1000 is a constant number of iterations:
  //   cmp     r14, 1000
  //   jb      loc_402546
  unsigned int open64_fd = 4;
  for (unsigned int i = 0; i < 1000; i++, open64_fd++) {
    hash = merge_hashes(hash, kChallengeImageHash) + open64_fd;
  }

  // Final stage: xor with the contents of xor.bin.
  hash ^= kXorConstant;

  // Display solution.
  uint8_t hash_string[12];
  memcpy(hash_string, &hash, sizeof(uint64_t));
  hash_string[8] = '\0';
  printf("hash(\"%s\") = %llx, sigint_%s\n", filename, hash, hash_string);

  return 0;
}

The output of the above code was as follows:

hash("/0x90.run") = 52336d6d6148636d, sigint_mcHamm3R

As you can imagine, "sigint_mcHamm3R" turned out to be the correct flag. +300 points. :) While writing a C++ turned out to be faster than waiting for 0x90.run to complete in Bochs, you could as well just wait for around 30 minutes and grab the flag directly from the program standard output:


No comments:

Post a Comment