3 Comments

Decompiling SCI byte code – part 1

SCI is the Sierra Creative Interpreter – the scripting language and runtime on which Sierra’s adventure games were built between roughly 1988 and 1996.

As part of the SCI Companion project (an IDE for SCI) I worked on over the past year, I wrote a decompiler. In this series of posts I’ll go over the process of generating source code from compiled byte code, and the technical challenges and limitations of doing so.

A good reference for SCI byte code is this page in the ScummVM wiki.

This post is part 1 of the series. Here are links to the rest:

Decompiler

In contrast to a disassembler, implementing a decompiler is quite a task. A disassembler is basically just assembly language with symbol lookups. That is, a direct listing of the byte code but with textual opcodes and the operand numbers replaced with variable names, literals or object names where possible.

The raw byte code for this method.

The raw byte code for the Head::showSelf method in Space Quest 1 VGA

 

HeadShowSelf

Disassembly from SV (Sierra Viewer) for the Head::showSelf method in Space Quest 1 VGA. With some patience it’s possible to understand what this method is doing, but it’s very slow-going.

 

A decompiler, on the other hand, attempts to recover the higher level code structure: if statements, loops, switch statements and so on. In addition, some work is done to generate meaningful variable names where possible.

 

SourceCode

SCI Companion’s decompiled source code generated for the Head::showSelf method.

 

Compared to a more generic x86 decompiler, say, decompiling SCI byte code is easier. SCI byte code and script resources actually have a decent amount of “original information” in them. It is probably comparable to that of CIL (what C# and other .net languages compile to), and certainly much more than compiled native x86 code, for instance.

We can retrieve the following information from the compiled script resources:

  • class and instance names
  • method names (both for calls and implementations)
  • property names

Some things whose original names are not retrievable are:

  • temporary variables, script and global variables, function parameters
  • procedures (i.e. functions that exist outside an object)

For many of these items we can make reasonable guesses though.

SCI byte code also has the advantage of being generated by a single compiler (at least for the original Sierra games – not for fan games). This means patterns of instructions representing a particular higher level construct will generally be similar. For instance, there is a very distinct pattern for switch statements.

Start to finish

The starting point is the SCI script resources. These contain code logic, class and instance definitions, strings, and script variables. Different versions of SCI use different formats for script resources, but my resource loader loads each into a single runtime format so it can be interacted with in a consistent way.

The ending point is an abstract syntax tree describing the entire script file. This is a tree of code constructs like if statements, loops, and function calls. From here, I can run the AST through a syntax formatter that outputs the source code in one of two (currently) source code languages (I also had a c++ style source code I was working on for a while, but I abandoned that).

 

Determining code boundaries

The first step is to determine the boundaries of a function. SCI script resources can have public procedures and objects with methods. In each of these cases, an offset to the beginning of the method is provided. There is no end point or code size given, other than the total length of the entire script resource. We can try to determine end points by keeping track of where other chunks of data in the script resource begin, but that method isn’t always reliable.

Instead, to determine where a function ends, we can look for a RET opcode (return), which returns control to the calling function. However, RET can appear in the middle of a function too. So we actually need to look at all the branching instructions to keep track of the “furthest” point to which we have branched. If we encounter a RET instruction that is after any branch target, then we know we’ve reached the end of the function. Some corrupt scripts don’t satisfy this condition, and so this is a possible point of failure for decompilation.

At this point, it might be worth pointing out that SCI’s limited scope and having knowledge of the compiler originally used helps us out. For instance, a generic x86 decompiler would be a much more difficult task, since function code may not be laid out in sequence in memory. However, for SCI we know this to be true. We’ll never encounter a function that jumps to another arbitrary place in memory, so we can always assume that if there are no branch targets beyond a RET instruction we’ve encountered, that it must be the end of the function.

 

Determining the boundary for the Collect::eachElementDo method. There's a RET in the middle of the function, but we know there is a branch instruction that points after the RET - so the first RET doesn't signal the end of the function.

Determining the boundary for the Collect::eachElementDo method. There’s a RET in the middle of the function, but we know there is a branch instruction that points after the RET – so the first RET doesn’t signal the end of the function.

 

One final challenge here is that SCI also supports local procedures in the script resources. For these (unlike public procedures and methods), we have no entry point listed anywhere in the script resource. Instead, we need to scan all bytecode for CALL opcodes, and put together a list of the call target addresses. These would be our entry points for local procedures.

After all this work is done, we have a set of start/end points for the code segments (functions) that we need to decompile.

Partitioning code into blocks

In order to perform control flow analysis (the most difficult step of decompilation), we need to distill the instructions down to some fundamental nodes that we can use in our analysis.

The interesting boundaries for control flow are the branch instructions (bnt, bt and jmp) within the code segments. Boundaries are also defined by the targets of branch instructions.

Let’s take a look at this piece of script:

;;; Sierra Script 1.0 - (do not remove this comment)
(script# 19)

(use Main)
(use Print)

(public
	SomeProc 0
)

(procedure (SomeProc &tmp someVar anotherVar)
	(= someVar 11)
	(= anotherVar 0)
	(if (== gRoomNumber 13)
		(Prints "Blah")
		(while (> someVar 0)
			(Printf "Count: %d" someVar)
			(if (== 0 (mod anotherVar 2))
				(++ anotherVar)
			)
			(-- someVar)
		)
	)
)

 

Here’s the assembly that would correspond to this script, annotated to indicate where the boundaries are between sequences of instructions that comprise “nodes” that are interesting for control flow.

/ EXPORTED procedure #0 (SomeProc)
(procedure proc_000a
  000a:3f 02             link 2 // (var $2)
  000c:35 0b              ldi b 
  000e:a5 00              sat temp0 
  0010:35 00              ldi 0 
  0012:a5 01              sat temp1 
  0014:89 0b              lsg gRoomNumber 
  0016:35 0d              ldi d 
  0018:1a                 eq? 
  0019:31 2c              bnt code_0047 
-------------- BOUNDARY
  001b:78               push1 
  001c:74 0006          lofss $0006 // Blah
  001f:46 0399 0000 02  calle 399 procedure_0000 2 // Prints 
-------------- BOUNDARY
        code_0025
  0025:8d 00              lst temp0 
  0027:35 00              ldi 0 
  0029:1e                 gt? 
  002a:31 1b              bnt code_0047 
-------------- BOUNDARY
  002c:7a               push2 
  002d:74 000b          lofss $000b // Count: %d
  0030:8d 00              lst temp0 
  0032:46 0399 0001 04  calle 399 procedure_0001 4 // Printf 

  0038:76               push0 
  0039:8d 01              lst temp1 
  003b:35 02              ldi 2 
  003d:0a                 mod 
  003e:1a                 eq? 
  003f:31 02              bnt code_0043 
-------------- BOUNDARY
  0041:c5 01              +at temp1 
-------------- BOUNDARY
        code_0043
  0043:e5 00              -at temp0 
  0045:33 de              jmp code_0025 
-------------- BOUNDARY
        code_0047
  0047:48                 ret 
)

 

This is easier to visualize in graph form. The nodes in the graph are marked by their address in the script resource, followed by the list of assembly instructions that are part of that node:

 

RawControlFlowGraph

 

 

You’ll note that we have created one extra node in this graph: the beginning of the set of instructions that lead up to the first if statement have also been marked as a boundary. This is a sort of tweak that will make it easier to distinguish if statements with compound conditions vs nested if statements that have code between the if statements that isn’t part of the 2nd if’s condition.

These graphs form the basis of the next step, control flow analysis.

 

Advertisements

3 comments on “Decompiling SCI byte code – part 1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Just another WordPress site

Just another WordPress.com site

The Space Quest Historian

Adventure game blogs, Let's Plays, live streams, and more

Harebrained Schemes

Developer's blog for IceFall Games

kosmonaut's blog

3d GFX and more

Halogenica

Turn up the rez!

bitsquid: development blog

Developer's blog for IceFall Games

Game Development by Sean

Developer's blog for IceFall Games

Lost Garden

Developer's blog for IceFall Games

Memories

Developer's blog for IceFall Games

Casey's Blog

Developer's blog for IceFall Games

Blog

Developer's blog for IceFall Games

Rendering Evolution

Developer's blog for IceFall Games

Simon schreibt.

Developer's blog for IceFall Games

Dev & Techno-phage

Do Computers Dream of Electric Developper?

- Woolfe -

Developer's blog for IceFall Games

Fabio Ferrara

Game Developer

Clone of Duty: Stonehenge

First Person Shooter coming soon to the XBOX 360

Low Tide Productions

Games and other artsy stuff...

BadCorporateLogo

Just another WordPress.com site

Sipty's Writing

Take a look inside the mind of a game developer.

%d bloggers like this: