Here’s a link to part 1 of this series.
Now we have a directed graph that consists of nodes of branch-less instruction sequences in a function (since the branches became the edges of the graph). This is what is needed to perform control flow analysis.
Here are some resources (generally white papers, course notes or wikipedia pages) that were useful for implementing control flow analysis.
Given the “raw” graph of instruction sequences, we proceed to reduce this graph by iteratively incorporating more semantic meaning to parts of it.
For instance, if we recognize a group of nodes in the graph as a loop, we can replace those nodes with a single loop node that contains information indicating which child nodes represent the head, latch and looping condition of the loop.
The basic control structures we look for are: loops, switch statements, compound conditions and if statements.
When detecting a particular type of control structure, we generally proceed from innermost structures to outermost. There are some possible failure points here, for instance when two control structures overlap and one is not completely contained within another. In that case we have no way to reduce the control flow graph, and we fail the decompilation. However, as far as I know this never occurs in actual game code.
The first control flow structures we look for are loops. To do so, we must calculate the dominators for of each node of the graph. Node A dominates another node B when all paths to node B must first pass through node A. Node A is an immediate dominator of another node B when node A dominates node B, but does not dominate any other node that dominates node B. These concepts make formal reasoning over the control flow graph more straightforward.
We use an algorithm by Tarjan and Lengauer to calculate the dominators (a link to a paper on this is listed in the Resources above).
With the dominator information, we can identify back edges in the graph (these are the edges where the tail of the loop jumps back to the beginning of the loop). This happens when a node (which ends up being the head of the loop) dominates its predecessor (which will end up being the tail of the loop).
Consider the following method (participle::doVerb from King’s Quest 6):
(method (doVerb theVerb &tmp temp0 temp1) (super doVerb: theVerb &rest) (if (not (proc999_5 theVerb 1 94 28 13 12)) (= temp0 0) (while (< temp0 20) (= temp1 0) (while (< temp1 7000) (++ temp1) ) (DrawCel 970 5 (Random 0 (- (NumCels self) 1)) nsLeft nsTop 15 ) (++ temp0) ) ) (DrawCel 970 5 0 nsLeft nsTop 15) )
Here’s the control flow graph we generate for it (I’ve replaced the assembly instructions with the final decompiled constructs for clarity):
Once we’ve identified a back edge, we then need to find all the child nodes of the loop (the head, latch, and the other nodes that comprise the body of the loop). We can follow the tail node back to the head node to collect all these children.
There is one additional issue here though. Loops can contain break statements. In our situation, these will be nodes whose successors are outside the loop. So following the tail node back to the head node will not pick up these nodes. Instead, we determine the follow node of the loop. This is the node just outside the loop which follows the loop. It’s where breaks will jump to. This isn’t trivial, and this is one of the places where we currently break from graph theory a bit and take advantage of the fact that SCI byte code always proceeds forward in terms of memory addresses. This is also an occasional failure point of the current decompiler.
Once the follow node for the loop has been determined, we can then follow its predecessors, and include any that are dominated by the loop header.
The next control structures we attempt to find are switch statements. Switch statements have a distinct instruction pattern in SCI.
A copy of the value of the switch expression is kept on the stack, and this value is duplicated onto the stack again (using the DUP opcode) for each case. Finally, the TOSS opcode is issued to pop the value from the stack at the exit of the switch statement.
The TOSS opcode is the most distinctive thing here – switch statements are the only place this is used in SCI. So the first step is to find a node that starts with a TOSS.
Consider the following method:
(method (edgeToRoom theEdgeHit) (switch theEdgeHit (EDGE_TOP north) (EDGE_RIGHT east) (EDGE_BOTTOM south) (EDGE_LEFT west) ) )
Which is represented by the following disassembly:
(method (edgeToRoom) // method_0f05 0f05:8f 01 lsp param1 // the switch value 0f07:3c dup // we dupe it on the stack for each case 0f08:35 01 ldi 1 // the case value 0f0a:1a eq? 0f0b:31 04 bnt code_0f11 // branch to next case 0f0d:63 2a pToa north 0f0f:33 1c jmp code_0f2d // jump to switch exit code_0f11 0f11:3c dup // duping the switch value for the next case 0f12:35 02 ldi 2 // the case value 0f14:1a eq? 0f15:31 04 bnt code_0f1b // branch to next case, and so on... 0f17:63 2c pToa east 0f19:33 12 jmp code_0f2d code_0f1b 0f1b:3c dup 0f1c:35 03 ldi 3 0f1e:1a eq? 0f1f:31 04 bnt code_0f25 0f21:63 2e pToa south 0f23:33 08 jmp code_0f2d code_0f25 0f25:3c dup 0f26:35 04 ldi 4 0f28:1a eq? 0f29:31 02 bnt code_0f2d 0f2b:63 30 pToa west code_0f2d 0f2d:3a toss // Throw away the switch value on the stack 0f2e:48 ret // since it's no longer needed. )
This is represented by the following control flow graph:
When we find the node that starts with a TOSS, we can enumerate all its predecessors until we find one that dominates it (i.e. all paths that reach the TOSS node go through that node). The predecessor of that node will be the head node of the switch statement.
Once we have the collection of nodes that comprise the switch statement, we can then proceed to determine the case nodes. We assume the cases follow a pattern like so:
- a DUP instruction
- [some expression that puts a value into the accumulator]
- an EQ? instruction (comparing the switch value to the case expression)
- a BNT instruction to the next case
Proceeding forward from the switch head, and following the “else” paths of each branch node, we can determine the beginning nodes of each case statement. To find the ends, we can look for predecessors of the TOSS node that are dominated by the case heads.
The final node structure for the function above might look like this:
Next, we look for patterns that represent compound conditions (e.g. if (A and B)). In the Cifuentes paper, it is shown that shortcircuited compound conditions can be reduced to one of four different patterns that represent the following cases:
- A or B
- A and B
- (not A) or B
- (not A) and B
In assembly, these would correspond to the 4 variations of the following pattern:
expression A bt/bnt then/else expression B bnt else
The Cifuentes paper uses the following diagram to list these out and show the 4 branching possibilities:
More complex boolean expressions end up just being nested compound conditions, so we repeat this pattern-finding iteratively (each time grouping the relevant nodes under a “compound condition” semantic node) until no more compound conditions are found.
As this point, we do some fixing up of our control flow graph. Of concern are the patterns for break and continue statements in loops. These break the natural flow of the program and will interfere with our if statement detection.
Fundamentally, an if statement pattern is defined by a header node with two branches, where each branch has a common follow node that is dominated by the header node (i.e., all paths to the follow node pass through the header node). The basic control flow for if-then and if-then-else are shown below:
If the if-statement body contains a break or a continue, then the recognizable pattern won’t be preserved. Consider the following loop:
(while temp0 (Prints "A") (-- temp0) (if (== temp0 4) (break) ) (Prints "B") )
Though it contains an if-statement, we don’t see the if-statement pattern we want in the control flow graph that we’ve constructed for its disassembly:
So we look for jumps to the loop exits (or to the loop header), and replace them with “semantic” break nodes (or continue nodes) and make the subsequent code be their successor node:
If-statements are the last control structures we determine. We do them last because other control structures like loops and switch statements might have patterns within them that we would incorrectly detect as if-statements. By doing them last, we ensure that we correctly identify those other control structures.
The if-then and if-then-else patterns were described in the previous section. With those recognized, the previous example becomes something like this:
After this major step of control flow analysis is done, we end up with a much more semantically meaningful graph of the function. It consists of the following types of nodes:
- raw code (sequences of instructions)
- compound conditions
- break/continue nodes
During processing, each of these nodes has had some of its child nodes semantically tagged. For instance, compound conditions have one of their children marked as “first”, and the other as “second”. Ifs have one of their children tagged as “then”, and possibly another as “else” (to indicate which nodes begin the bodies of the then and else clauses), and another marked to indicate it is the condition node.
At this point we have a good idea what the final structure of the function source code will look like, but we’re still far from done.
The above graph ended up getting decompiled into the following source code:
(method (makeARock) (if (== rockCount maxRocks) (if (== gArrayScriptGetValue 300) (gKq6Messager say: 8 5 18 2 0 21) ) ((global2 script?) cue:) else (ShakeScreen 1 (Random 0 2)) (global104 number: 300 setLoop: 1 play:) (if (> rockCount 0) ([newRockStep (- rockCount 1)] stopUpd:) ) ((= [newRockStep rockCount] (RockStep new:)) x: (cond ((== gArrayScriptGetValue 300) [local278 rockCount]) (flipRocks (- 320 [local246 rockCount])) (else [local246 rockCount]) ) y: (if (== gArrayScriptGetValue 300) [local287 rockCount] else [local262 rockCount] ) cel: (if (or (== global12 130) (== global12 340) (== global12 370) ) 2 else 0 ) rockPointer: rockCount corner: (cond ( (and (== rockCount 3) (== gArrayScriptGetValue 300)) 1) ( (and (or (== rockCount 6) (== rockCount 11)) (== gArrayScriptGetValue 320) ) 1 ) (else 0) ) capStone: (if (or (and (== rockCount 14) (== gArrayScriptGetValue 320)) (and (== rockCount 7) (== gArrayScriptGetValue 300)) ) 1 else 0 ) init: setCycle: End RockStep ) (self rockCount: (+ (self rockCount?) 1)) ) )
The next step we’ll look at in the decompilation process is Instruction Consumption.