1 Comment

Decompiling SCI – part 2 – control flow

Here’s a link to part 1 of this series.

Now we have a directed graph that consists of nodes of branch-less instruction sequences in a function (since the branches became the edges of the graph). This is what is needed to perform control flow analysis.

Resources

Here are some resources (generally white papers, course notes or wikipedia pages) that were useful for implementing control flow analysis.

 

Overview

Given the “raw” graph of instruction sequences, we proceed to reduce this graph by iteratively incorporating more semantic meaning to parts of it.

For instance, if we recognize a group of nodes in the graph as a loop, we can replace those nodes with a single loop node that contains information indicating which child nodes represent the head, latch and looping condition of the loop.

The basic control structures we look for are: loops, switch statements, compound conditions and if statements.

When detecting a particular type of control structure, we generally proceed from innermost structures to outermost. There are some possible failure points here, for instance when two control structures overlap and one is not completely contained within another. In that case we have no way to reduce the control flow graph, and we fail the decompilation. However, as far as I know this never occurs in actual game code.

Loops

The first control flow structures we look for are loops. To do so, we must calculate the dominators for of each node of the graph. Node A dominates another node B when all paths to node B must first pass through node A. Node A is an immediate dominator of another node B when node A dominates node B, but does not dominate any other node that dominates node B. These concepts make formal reasoning over the control flow graph more straightforward.

We use an algorithm by Tarjan and Lengauer to calculate the dominators (a link to a paper on this is listed in the Resources above).

With the dominator information, we can identify back edges in the graph (these are the edges where the tail of the loop jumps back to the beginning of the loop). This happens when a node (which ends up being the head of the loop) dominates its predecessor (which will end up being the tail of the loop).

Consider the following method (participle::doVerb from King’s Quest 6):

	(method (doVerb theVerb &tmp temp0 temp1)
		(super doVerb: theVerb &rest)
		(if (not (proc999_5 theVerb 1 94 28 13 12))
			(= temp0 0)
			(while (< temp0 20)
				(= temp1 0)
				(while (< temp1 7000)
					(++ temp1)
				)
				(DrawCel
					970
					5
					(Random 0 (- (NumCels self) 1))
					nsLeft
					nsTop
					15
				)
				(++ temp0)
			)
		)
		(DrawCel 970 5 0 nsLeft nsTop 15)
	)

Here’s the control flow graph we generate for it (I’ve replaced the assembly instructions with the final decompiled constructs for clarity):

 

Back edges are highlighted in red. When you have a node A that has a predecessor B, and all paths to that predecessor B must go through node A, then this is a loop back edge. The back edge goes from the loop latch to the loop header.

Back edges are highlighted in red, and the resulting loops are highlighted in blue and cyan (one contained within the other). When you have a node A that has a predecessor B, and all paths to that predecessor B must go through node A, then this is a loop back edge. The back edge goes from the loop latch to the loop header. We can then follow predecessor paths from the latch back to the head in order to determine which nodes are children of the loop structure.

 

Once we’ve identified a back edge, we then need to find all the child nodes of the loop (the head, latch, and the other nodes that comprise the body of the loop). We can follow the tail node back to the head node to collect all these children.

There is one additional issue here though. Loops can contain break statements. In our situation, these will be nodes whose successors are outside the loop. So following the tail node back to the head node will not pick up these nodes. Instead, we determine the follow node of the loop. This is the node just outside the loop which follows the loop. It’s where breaks will jump to. This isn’t trivial, and this is one of the places where we currently break from graph theory a bit and take advantage of the fact that SCI byte code always proceeds forward in terms of memory addresses. This is also an occasional failure point of the current decompiler.

Once the follow node for the loop has been determined, we can then follow its predecessors, and include any that are dominated by the loop header.

Switch statements

The next control structures we attempt to find are switch statements. Switch statements have a distinct instruction pattern in SCI.

A copy of the value of the switch expression is kept on the stack, and this value is duplicated onto the stack again (using the DUP opcode) for each case. Finally, the TOSS opcode is issued to pop the value from the stack at the exit of the switch statement.

The TOSS opcode is the most distinctive thing here – switch statements are the only place this is used in SCI. So the first step is to find a node that starts with a TOSS.

Consider the following method:

	(method (edgeToRoom theEdgeHit)
		(switch theEdgeHit
			(EDGE_TOP north)
			(EDGE_RIGHT east)
			(EDGE_BOTTOM south)
			(EDGE_LEFT west)
		)
	)

Which is represented by the following disassembly:

    (method (edgeToRoom) // method_0f05
  0f05:8f 01              lsp param1      // the switch value
  0f07:3c                 dup             // we dupe it on the stack for each case
  0f08:35 01              ldi 1           // the case value
  0f0a:1a                 eq? 
  0f0b:31 04              bnt code_0f11   // branch to next case
  0f0d:63 2a             pToa north 
  0f0f:33 1c              jmp code_0f2d   // jump to switch exit

        code_0f11
  0f11:3c                 dup             // duping the switch value for the next case
  0f12:35 02              ldi 2           // the case value
  0f14:1a                 eq? 
  0f15:31 04              bnt code_0f1b   // branch to next case, and so on...
  0f17:63 2c             pToa east 
  0f19:33 12              jmp code_0f2d 

        code_0f1b
  0f1b:3c                 dup 
  0f1c:35 03              ldi 3 
  0f1e:1a                 eq? 
  0f1f:31 04              bnt code_0f25 
  0f21:63 2e             pToa south 
  0f23:33 08              jmp code_0f2d 

        code_0f25
  0f25:3c                 dup 
  0f26:35 04              ldi 4 
  0f28:1a                 eq? 
  0f29:31 02              bnt code_0f2d 
  0f2b:63 30             pToa west 

        code_0f2d
  0f2d:3a                toss             // Throw away the switch value on the stack
  0f2e:48                 ret             // since it's no longer needed.
    )

 

This is represented by the following control flow graph:

 

SwitchRaw

 

When we find the node that starts with a TOSS, we can enumerate all its predecessors until we find one that dominates it (i.e. all paths that reach the TOSS node go through that node). The predecessor of that node will be the head node of the switch statement.

Once we have the collection of nodes that comprise the switch statement, we can then proceed to determine the case nodes. We assume the cases follow a pattern like so:

  • a DUP instruction
  • [some expression that puts a value into the accumulator]
  • an EQ? instruction (comparing the switch value to the case expression)
  • a BNT instruction to the next case

Proceeding forward from the switch head, and following the “else” paths of each branch node, we can determine the beginning nodes of each case statement. To find the ends, we can look for predecessors of the TOSS node that are dominated by the case heads.

The final node structure for the function above might look like this:

 

SwitchFinal

 

Compound conditions

Next, we look for patterns that represent compound conditions (e.g. if (A and B)). In the Cifuentes paper, it is shown that shortcircuited compound conditions can be reduced to one of four different patterns that represent the following cases:

  • A or B
  • A and B
  • (not A) or B
  • (not A) and B

In assembly, these would correspond to the 4 variations of the following pattern:

		expression A
		bt/bnt then/else
		expression B
		bnt else	

The Cifuentes paper uses the following diagram to list these out and show the 4 branching possibilities:

 

CC

 

More complex boolean expressions end up just being nested compound conditions, so we repeat this pattern-finding iteratively (each time grouping the relevant nodes under a “compound condition” semantic node) until no more compound conditions are found.

Some fixup

As this point, we do some fixing up of our control flow graph. Of concern are the patterns for break and continue statements in loops. These break the natural flow of the program and will interfere with our if statement detection.

Fundamentally, an if statement pattern is defined by a header node with two branches, where each branch has a common follow node that is dominated by the header node (i.e., all paths to the follow node pass through the header node). The basic control flow for if-then and if-then-else are shown below:

 

IfStatementTypes

 

If the if-statement body contains a break or a continue, then the recognizable pattern won’t be preserved. Consider the following loop:

	(while temp0
		(Prints "A")
		(-- temp0)
		(if (== temp0 4)
			(break)
		)
		(Prints "B")
	)

 

Though it contains an if-statement, we don’t see the if-statement pattern we want in the control flow graph that we’ve constructed for its disassembly:

 

One branch of the if-break node leads to the loop exit. The if-statement doesn't have an easily recognizable pattern.

One branch of the if-break node leads to the loop exit. The if-statement doesn’t have an easily recognizable pattern.

 

So we look for jumps to the loop exits (or to the loop header), and replace them with “semantic” break nodes (or continue nodes) and make the subsequent code be their successor node:

 

A semantic break node has been inserted, and we now have a recognizable if-then pattern.

A semantic break node has been inserted, and we now have a recognizable if-then pattern (if you’ll pardon the .svg renderer shifting the graph nodes around a bit).

 

 

If statements

If-statements are the last control structures we determine. We do them last because other control structures like loops and switch statements might have patterns within them that we would incorrectly detect as if-statements. By doing them last, we ensure that we correctly identify those other control structures.

The if-then and if-then-else patterns were described in the previous section. With those recognized, the previous example becomes something like this:

The If statement has been recognized and substituted in the loop body.

The If statement has been recognized and substituted in the loop body.

 

Summary…

After this major step of control flow analysis is done, we end up with a much more semantically meaningful graph of the function. It consists of the following types of nodes:

  • raw code (sequences of instructions)
  • compound conditions
  • break/continue nodes
  • loops
  • switches/cases
  • ifs

During processing, each of these nodes has had some of its child nodes semantically tagged. For instance, compound conditions have one of their children marked as “first”, and the other as “second”. Ifs have one of their children tagged as “then”, and possibly another as “else” (to indicate which nodes begin the bodies of the then and else clauses), and another marked to indicate it is the condition node.

At this point we have a good idea what the final structure of the function source code will look like, but we’re still far from done.

The control flow graph for the makeARock method on the CliffRoom object in King Quest 6's script 21.

The complete control flow graph for the makeARock method on the CliffRoom object in King Quest 6’s script 21.

 

The above graph ended up getting decompiled into the following source code:

(method (makeARock)
	(if (== rockCount maxRocks)
		(if (== gArrayScriptGetValue 300)
			(gKq6Messager say: 8 5 18 2 0 21)
		)
		((global2 script?) cue:)
	else
		(ShakeScreen 1 (Random 0 2))
		(global104 number: 300 setLoop: 1 play:)
		(if (> rockCount 0)
			([newRockStep (- rockCount 1)] stopUpd:)
		)
		((= [newRockStep rockCount] (RockStep new:))
			x:
				(cond 
					((== gArrayScriptGetValue 300) [local278 rockCount])
					(flipRocks (- 320 [local246 rockCount]))
					(else [local246 rockCount])
				)
			y:
				(if (== gArrayScriptGetValue 300)
					[local287 rockCount]
				else
					[local262 rockCount]
				)
			cel:
				(if
					(or
						(== global12 130)
						(== global12 340)
						(== global12 370)
					)
					2
				else
					0
				)
			rockPointer: rockCount
			corner:
				(cond 
					(
					(and (== rockCount 3) (== gArrayScriptGetValue 300)) 1)
					(
						(and
							(or (== rockCount 6) (== rockCount 11))
							(== gArrayScriptGetValue 320)
						)
						1
					)
					(else 0)
				)
			capStone:
				(if
					(or
						(and (== rockCount 14) (== gArrayScriptGetValue 320))
						(and (== rockCount 7) (== gArrayScriptGetValue 300))
					)
					1
				else
					0
				)
			init:
			setCycle: End RockStep
		)
		(self rockCount: (+ (self rockCount?) 1))
	)
)

The next step we’ll look at in the decompilation process is Instruction Consumption.

Advertisements

One comment on “Decompiling SCI – part 2 – control flow

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Space Quest Historian

Adventure game blogs, Let's Plays, live streams, and more

Harebrained Schemes

Developer's blog for IceFall Games

kosmonaut's blog

3d GFX and more

Halogenica

Turn up the rez!

bitsquid: development blog

Developer's blog for IceFall Games

Game Development by Sean

Developer's blog for IceFall Games

Lost Garden

Developer's blog for IceFall Games

Memories

Developer's blog for IceFall Games

Casey's Blog

Developer's blog for IceFall Games

Blog

Developer's blog for IceFall Games

Rendering Evolution

Developer's blog for IceFall Games

Simon schreibt.

Developer's blog for IceFall Games

Dev & Techno-phage

Do Computers Dream of Electric Developper?

- Woolfe -

Developer's blog for IceFall Games

Ferrara Fabio

Game & Application Developer, 3D Animator, Composer.

Clone of Duty: Stonehenge

First Person Shooter coming soon to the XBOX 360

Low Tide Productions

Games and other artsy stuff...

BadCorporateLogo

Just another WordPress.com site

Sipty's Writing

Take a look inside the mind of a game developer.

Jonas Kyratzes

Writer, game designer, filmmaker.

%d bloggers like this: