SCI is the Sierra Creative Interpreter – the scripting language and runtime on which Sierra’s adventure games were built between roughly 1988 and 1996.
As part of the SCI Companion project (an IDE for SCI) I worked on over the past year, I wrote a decompiler. In this series of posts I’ll go over the process of generating source code from compiled byte code, and the technical challenges and limitations of doing so.
A good reference for SCI byte code is this page in the ScummVM wiki.
This post is part 1 of the series. Here are links to the rest:
In contrast to a disassembler, implementing a decompiler is quite a task. A disassembler is basically just assembly language with symbol lookups. That is, a direct listing of the byte code but with textual opcodes and the operand numbers replaced with variable names, literals or object names where possible.
A decompiler, on the other hand, attempts to recover the higher level code structure: if statements, loops, switch statements and so on. In addition, some work is done to generate meaningful variable names where possible.
Compared to a more generic x86 decompiler, say, decompiling SCI byte code is easier. SCI byte code and script resources actually have a decent amount of “original information” in them. It is probably comparable to that of CIL (what C# and other .net languages compile to), and certainly much more than compiled native x86 code, for instance.
We can retrieve the following information from the compiled script resources:
- class and instance names
- method names (both for calls and implementations)
- property names
Some things whose original names are not retrievable are:
- temporary variables, script and global variables, function parameters
- procedures (i.e. functions that exist outside an object)
For many of these items we can make reasonable guesses though.
SCI byte code also has the advantage of being generated by a single compiler (at least for the original Sierra games – not for fan games). This means patterns of instructions representing a particular higher level construct will generally be similar. For instance, there is a very distinct pattern for switch statements.
Start to finish
The starting point is the SCI script resources. These contain code logic, class and instance definitions, strings, and script variables. Different versions of SCI use different formats for script resources, but my resource loader loads each into a single runtime format so it can be interacted with in a consistent way.
The ending point is an abstract syntax tree describing the entire script file. This is a tree of code constructs like if statements, loops, and function calls. From here, I can run the AST through a syntax formatter that outputs the source code in one of two (currently) source code languages (I also had a c++ style source code I was working on for a while, but I abandoned that).
Determining code boundaries
The first step is to determine the boundaries of a function. SCI script resources can have public procedures and objects with methods. In each of these cases, an offset to the beginning of the method is provided. There is no end point or code size given, other than the total length of the entire script resource. We can try to determine end points by keeping track of where other chunks of data in the script resource begin, but that method isn’t always reliable.
Instead, to determine where a function ends, we can look for a RET opcode (return), which returns control to the calling function. However, RET can appear in the middle of a function too. So we actually need to look at all the branching instructions to keep track of the “furthest” point to which we have branched. If we encounter a RET instruction that is after any branch target, then we know we’ve reached the end of the function. Some corrupt scripts don’t satisfy this condition, and so this is a possible point of failure for decompilation.
At this point, it might be worth pointing out that SCI’s limited scope and having knowledge of the compiler originally used helps us out. For instance, a generic x86 decompiler would be a much more difficult task, since function code may not be laid out in sequence in memory. However, for SCI we know this to be true. We’ll never encounter a function that jumps to another arbitrary place in memory, so we can always assume that if there are no branch targets beyond a RET instruction we’ve encountered, that it must be the end of the function.
One final challenge here is that SCI also supports local procedures in the script resources. For these (unlike public procedures and methods), we have no entry point listed anywhere in the script resource. Instead, we need to scan all bytecode for CALL opcodes, and put together a list of the call target addresses. These would be our entry points for local procedures.
After all this work is done, we have a set of start/end points for the code segments (functions) that we need to decompile.
Partitioning code into blocks
In order to perform control flow analysis (the most difficult step of decompilation), we need to distill the instructions down to some fundamental nodes that we can use in our analysis.
The interesting boundaries for control flow are the branch instructions (bnt, bt and jmp) within the code segments. Boundaries are also defined by the targets of branch instructions.
Let’s take a look at this piece of script:
;;; Sierra Script 1.0 - (do not remove this comment) (script# 19) (use Main) (use Print) (public SomeProc 0 ) (procedure (SomeProc &tmp someVar anotherVar) (= someVar 11) (= anotherVar 0) (if (== gRoomNumber 13) (Prints "Blah") (while (> someVar 0) (Printf "Count: %d" someVar) (if (== 0 (mod anotherVar 2)) (++ anotherVar) ) (-- someVar) ) ) )
Here’s the assembly that would correspond to this script, annotated to indicate where the boundaries are between sequences of instructions that comprise “nodes” that are interesting for control flow.
/ EXPORTED procedure #0 (SomeProc) (procedure proc_000a 000a:3f 02 link 2 // (var $2) 000c:35 0b ldi b 000e:a5 00 sat temp0 0010:35 00 ldi 0 0012:a5 01 sat temp1 0014:89 0b lsg gRoomNumber 0016:35 0d ldi d 0018:1a eq? 0019:31 2c bnt code_0047 -------------- BOUNDARY 001b:78 push1 001c:74 0006 lofss $0006 // Blah 001f:46 0399 0000 02 calle 399 procedure_0000 2 // Prints -------------- BOUNDARY code_0025 0025:8d 00 lst temp0 0027:35 00 ldi 0 0029:1e gt? 002a:31 1b bnt code_0047 -------------- BOUNDARY 002c:7a push2 002d:74 000b lofss $000b // Count: %d 0030:8d 00 lst temp0 0032:46 0399 0001 04 calle 399 procedure_0001 4 // Printf 0038:76 push0 0039:8d 01 lst temp1 003b:35 02 ldi 2 003d:0a mod 003e:1a eq? 003f:31 02 bnt code_0043 -------------- BOUNDARY 0041:c5 01 +at temp1 -------------- BOUNDARY code_0043 0043:e5 00 -at temp0 0045:33 de jmp code_0025 -------------- BOUNDARY code_0047 0047:48 ret )
This is easier to visualize in graph form. The nodes in the graph are marked by their address in the script resource, followed by the list of assembly instructions that are part of that node:
You’ll note that we have created one extra node in this graph: the beginning of the set of instructions that lead up to the first if statement have also been marked as a boundary. This is a sort of tweak that will make it easier to distinguish if statements with compound conditions vs nested if statements that have code between the if statements that isn’t part of the 2nd if’s condition.
These graphs form the basis of the next step, control flow analysis.