Here’s a link to part 1 of this series.
Another fundamental concept we need to introduce is instruction consumption.
Each SCI instruction (opcode + operands) “consumes” a certain number of values from the stack, or the value of the accumulator or prev registers. Likewise, it may “generate” a result that gets pushed onto the stack, or placed in the accumulator or prev register.
Let’s look at a simple example. Consider the following code:
(= localVar (+ param1 3))
and its corresponding assembly instructions:
lsp param1 ldi 3 add sal localVar
How do we go about actually constructing the former from that latter?
We can start from the last instruction (sal localVar), which means “store the value of the accumulator in this local variable”. From this, we know we we’ll have an assignment operation, and the value that is being assigned will come from the value in the accumulator.
So going backwards, we see the previous instruction is add. This satisfies our need for an accumulator value, since add places its result in the accumulator. So the result of the add is what will end up being assigned to the local variable above.
Of course, add consumes an accumulator value itself, in addition to popping a value from the stack. So we work backwards again trying to find an instruction that pushes a value onto the stack, and the next instruction that places a value into the accumulator. Luckily, in this case, these are immediately satisfied. ldi places a value in the accumulator (3), and lsp pushes a value onto the stack (the value of param1).
Essentially, we’re building a tree of instructions, where a node represents an instruction and a node’s children represent instructions that generate the data needed for the node’s “consumption”. So in the above example, we’d have:
- sal local0
- ldi 3
- lsp param1
In practice of course, these trees become quite large and nested quite deeply. SCI has some fairly complex instructions that can consume arbitrary amounts of stack values.
Things aren’t always so clean as in the previous example.
For instance, it is not uncommon for a value in the accumulator to be “consumed” by multiple instructions.
Consider the following assembly from the spotEgo method in script 80 of King’s Quest 6:
pToa spotEgoScr bnt code_01f6 // Consumes the accumulator, which is spotEgoScr. pushi 92 // $92 setScript pushi 3 push // Consumes the accumulator, which is spotEgoScr. push0 lsp param1 lag global2 send a
The value in the accumulator is used twice. This decompiles into the following code, which is a pretty common pattern (branch on something, then use the value that was branched on):
(if spotEgoScr (global2 setScript: spotEgoScr 0 param1) )
Speaking of branching, what happens if we encounter a branch or jump instruction while backtracking? Suddenly, we can’t form a good idea of what constitutes the statement we’re trying to build, since the shape of our instruction consumption tree would depend on which branch was taken.
Tying it together with the control flow graph.
Both the control flow graph and the instruction consumption logic work together in determining the final decompiled source code.
The control flow analysis should have removed any stray branch instructions (jmp, bt or bnt) in the sense that they would have turned into semantic control flow graph nodes (if statements, loops, etc…), leaving just non-branching sequences of instructions. So our instruction consumption code doesn’t really need to deal with them anymore.
However, it does need to deal with the problem of not being able to satisfy instruction consumption in the current sequence of instructions. So let’s see how that might work.
Let’s revisit what we have in our control flow graph. Yes, we have all those semantic nodes: if, loop, switch and so on. But we also have “raw code” nodes in the leaves of the graph. That is, the condition node in a loop will contain the raw set of instructions that comprise that condition.
So if we take our example from the previous section, after doing our first pass of instruction consumption, we’ll end up with something like this:
[If] [Condition] bnt [01e5] pToa [01e3] [Then] send [01f2] pushi [01e7] pushi [01ea] push [01ec] [NeedsAccumulator] push0 [01ed] lsp [01ee] lag [01f0]
You’ll see that we were not able to find an accumulator value to satisfy the push instruction. So we tagged it as “needs accumulator”.
On a second pass, we’ll try to work backwards to find the appropriate value. Below, you’ll see that we “cloned” the pToa value from above:
[If] [Condition] bnt [01e5] pToa [01e3] [Then] send [01f2] pushi [01e7] pushi [01ea] push [01ec] pToa [01e3] push0 [01ed] lsp [01ee] lag [01f0]
Working backwards doesn’t just mean returning to the original set of raw instructions and scanning backwards, of course. Then we run into the problem again of branching interfering with a static known path to our current instruction.
Instead, we need to follow the control flow graph hierarchy “up and backwards”. And the way we do that ends up being fairly complex, and changes depending on the type of the node in the control flow graph.
Another point that is important to bring up: Some of the control flow graph nodes themselves need to be treated as though they generate a value for the accumulator. For instance, if-statements, switch-statements and isolated compound conditions are all treated as such. That is, if we can’t find an accumulator-value-producing instruction to satisfy our instruction consumption, and we encounter an if node while navigating “up and backwards”, then we’ll use that if node as an accumulator-value-producer. This is a pattern we see often in source code like this (this would be the equivalent of the ternary operator in most languages):
(= someVar (if blahblah 5 else 2) )
Of course, then we need to make a call as to whether we can remove the if node from its previous spot in the instruction consumption tree, or we have to leave it in and clone a new one (resulting in possibly incorrect code since we’ll evaluate the if-statement twice). These subtleties and complexities help to illustrate some of the difficulties in implementing an accurate decompiler.
During this phase we also do some analysis to classify loops: are they while-loops, do-loops or infinite loops? We can analyze the structure of the loop to figure this out. For instance, if the latch node has two successors, and the loop head node has only one, we know it’s a do-loop (the loop condition comes at the end).
At the end of the instruction consumption phase, we have now generated a “consumption node” tree that incorporates both information from the control flow graph, but also more detailed information on how individual code statements are constructed: flat sequences of instructions have been re-organized into hierarchical forms. We’re now ready for the next phase, which involves constructing the abstract syntax tree for the source code.
Below is an example of what the instruction consumption tree might look like. First, the decompiled method:
(method (constantRocks &tmp temp0) (rCliffs cheatCount: 0) (= temp0 0) (while (< temp0 7) ((= [newRockStep temp0] (RockStep new:)) x: (if argc (- 320 [local246 temp0]) else [local246 temp0]) y: [local262 temp0] cel: 3 rockPointer: temp0 corner: (if (== temp0 6) 1 else 0) init: stopUpd: ) (if (< temp0 6) ([newRockStep temp0] addToPic:)) (++ temp0) ) )
And then, the instruction consumption tree that was used to generate this:
link  send  pushi  push1  push0  class  sat [035d] ldi [035b] [While] [LoopBody] send [03b0] push1  push1  push [037a] [If] [Condition] bnt [036a] lap  [Else] lali  lat  [Then] sub  pushi [036c] lali  lat [036f] jmp  push0 [037b] push1 [037c] lsli [037f] lat [037d] pushi  push1  pushi  pushi  push1 [038a] push [038b] lat [037d] pushi [038c] push1 [038f] push [039c] [If] [Condition] bnt  eq?  push  lat [037d] ldi  [Else] ldi [039a] [Then] ldi  jmp  pushi [039d] push0 [039f] pushi [03a0] push0 [03a3] sali [03ae] push [03ab] send [03a9] pushi [03a4] push0 [03a6] class [03a7] lat [03ac] [If] [Condition] bnt [03b7] lt? [03b6] lst [03b2] ldi [03b4] [Then] send [03c1] pushi [03b9] push0 [03bc] lali [03bf] lat [03bd] +at [03c3] jmp [03c5] [Condition] bnt  lt?  lst [035f] ldi  ret [03c7]
The next post ties everything together and summarizes the final steps.