In the first installment of this series, we got a Z80 architecture up and running with disassembly and control graphs. In this second installment, we’ll add lifting and discuss Binary Ninja’s intermediate languages.
Recall an architecture plugin’s job: to inform Binary Ninja (Binja) about an unknown architecture so Binja can open new binaries, perform analysis, etc. This is done by populating Architecture class fields like
.regs and implementing class methods like
get_instruction_low_level_il() from architecture.py:
|member variables in the architecture class||describe registers, flags, address size, etc.|
||describe instruction size, branch behavior|
||provide an instruction’s string representation|
||provide instruction semantics|
$ git clone https://github.com/Vector35/Z80 $ cd Z80 && git checkout tutorial2_start
get_instruction_low_level_il() in Z80Arch.py is awaiting a real definition:
def get_instruction_low_level_il(self, data:bytes, addr:int, il:'lowlevelil.LowLevelILFunction') -> Optional[int]: return None
What is Lifting?
Lifting is translating from a lower level language to an intermediate language considered higher. Since it starts “lower” and ascends “higher”, the process earned the name “lifting”. I wonder if other terms like “raising” were considered. Anyway, the lower level language is the machine instructions of the architecture we’re targeting (Z80) and the intermediate language is called Binja’s low level intermediate language (LLIL). Why would it be called low level intermediate language if it’s intended to be “higher” than the machine instructions? Because in Binja’s hierarchy of IL’s, collectively called BNIL, LLIL is the lowest:
Why do this? The main reason is so Binja can have a common, generalized analysis target rather than a collection of analyzers per architecture. Imagine the maintenance required to maintain a special analyzer for every architecture, and having to construct a new one for each plugin!
So BNIL acts a common tongue, and it’s our job to write a translator from the target architecture to this common language Binja expects:
LLIL: The Callback
Alright, so Binja’s going to be asking us for LLIL by invoking our plugin’s
get_instruction_low_level_il(). Let’s first look at what we’re supplied to do the job with the type hints:
def get_instruction_low_level_il(self, data:bytes, addr:int, il:'lowlevelil.LowLevelILFunction') -> Optional[int]:
data and corresponding
length hold bytes from the instruction to be translated. The
il is an initially empty container that collects LLIL for this function that we produce on a per-instruction basis with invocations of this callback.
il parameter actually serves two roles. In addition to being the container for produced LLIL, it’s also a reference to a class with many LLIL producing helper functions. See: lowlevelil.py. As the container, we will be calling
il.append(...) to append LLIL instructions to it. As the helper, we will be calling
il.store() to produce
il.set_reg(...) to produce
Exercise #1: unimplemented
Well, we can’t lift Z80 bytes without first disassembling, so let’s start with a call to the disassembler. Note that since the first tutorial, Z80 has migrated away from skoolkit disassembler to z80dis.
def get_instruction_low_level_il(self, data:bytes, addr:int, il:'lowlevelil.LowLevelILFunction') -> Optional[int]: decoded = decode(data, addr) if decoded.status != DECODE_STATUS.OK or decoded.len == 0: return None
Do you remember the roles of the
il parameter? Let’s try them both. When an instruction disassembly succeeds, we’ll use the helper
il.unimplemented() to construct a simple single-instruction expression that marks that the instruction isn’t yet implemented by the lifter. Then, as the container, we’ll
il.append() it to function’s LLIL so far:
expr = il.unimplemented() il.append(expr) return decoded.len
See checkpoint #1. Now reload the binary and observe the disassembly:
Check out those red no symbols. Binja has received our
LLIL_UNIMPLEMENTED and by default tags each line with that symbol, a built-in todo list! Check the LLIL view and confirm that everything is lifted to unimplemented.
LLIL: Assembly Nature
Obviously to be a translator from Z80 to LLIL, we’re going to need to know something about LLIL. LLIL is intended to be as close to an assembly language as possible. There is a set of about 135 instructions available and execution flows sequentially except for branching instructions like call and jump. There are no structured control flow constructs except for an “if” instruction.
You don’t need to understand stack allocation, or how calling conventions are applied, or even largely how flags work. You simply need to describe the semantics of the instruction and BN figures figures the rest out for you, or allows you to specify the semantics specifically when they diverge from normal.
The instructions are the LLIL_XXX values from the
LowLevelIlOperation enumeration in generated
class LowLevelILOperation(enum.IntEnum): LLIL_NOP = 0 LLIL_SET_REG = 1 # ... LLIL_LOAD = 6 LLIL_STORE = 7 LLIL_PUSH = 8 LLIL_POP = 9 # ...
Hopefully the names of these LLIL instructions already have you imagining how they might map to some machine instructions. It was designed with ease of lifting in mind.
BNIL is a domain specific language (DSL), but you never write it down as text. Like we never produce the string
LLIL_STORE and ask Binja to compile or otherwise process it. Instead, it’s an internal or embedded DSL, being expressed in and hosted by the language the lifter was written, Python in this case.
Here’s an abridged table showing the correspondence between raw LLIL instructions and the hosting constructs in python:
Exercise #2: Lifting NOP
Let’s lift our first instruction to LLIL that has no operands:
LLIL_NOP. We simply detect when the Z80 disassembler found a Z80 nop and return an LLIL nop expression:
expr = None if decoded.op == OP.NOP: expr = il.nop() else: expr = il.unimplemented() il.append(expr)
The python encapsulation of our produced LLIL looks like:
Reload the binary, and since there’s no naturally occurring NOP in this land, patch in four zero bytes and behold:
See checkpoint #2.
LLIL: Tree Nature
Unlike assembly’s rigid instructions, LLIL is flexible because many of its instructions accept subexpressions in their operands. And those operands might contain subexpressions, and so on.
This nearly cries for a tree-like mental picture. Every LowLevelILInstruction can be thought of as internal nodes of a tree. Each of their operands that isn’t another LowLevelILInstruction is a leaf node. So we can revisit our lift of NOP and draw:
Trail of Bits’ Breaking Down Binary Ninja’s Low Level IL is approaching four years old, but I doubt its exposition of this concept can be topped and I recommend it as prerequisite reading. You’ll see some elements and examples in this post inspired from this seminal article. It also includes some code for printing existing lifted IL in a text outline format which reinforces the tree idea when while studying the product of preexisting lifters.
Another tool for studying existing LLIL is the bnil-graph plugin by withzombies. You can navigate to an instruction of interest and simply menu click to have it draw the tree structure within Binja. Further, it produces code that recognizes when an input IL matches the selected IL.
If LLIL is to be thought of as a tree-like language, then learning to translate to LLIL could benefit from cultivating an ability to think of assembly instructions in a tree-like manner. For me, an intermediate step of imagining a composition of functions is helpful. For example, the x86 instruction
mov eax, 2 might first become the composition
mov(reg("eax"), 2)) and after surveying lowlevelil.py I find my
mov() maps to
LLIL_SET_REG whose arguments tell me the 2 can’t be supplied naked, but requires a wrapping in
LLIL_CONST, resulting in:
LLIL_SET_REG ├─── eax └─── LLIL_CONST └─── 2
The x86 instruction
lea eax, [edx+ecx*4] might decompose to
lea(add(reg("edx"),mul(reg("ecx"), 4))) before mapping to
LLIL_MUL which can be arranged into the tree:
LLIL_SET_REG ├─── eax └─── LLIL_ADD ├─── LLIL_REG │ └─── edx └─── LLIL_MUL ├─── LLIL_REG │ └─── ecx └─── LLIL_CONST └─── 2
Exercise #3: Lifting PUSH
Let’s try lifting PUSH. We’ll map it to Binja’s
LLIL_PUSH. There is not a separate LLIL instruction for pushing registers or constants or flags.
LLIL_PUSH works in general for whatever is supplied as an operand. Here, we implement the case where the object of the push is a register, built in the variable
elif decoded.op == OP.PUSH: if decoded.operands: (oper_type, oper_val) = decoded.operands if oper_type == OPER_TYPE.REG: subexpr = il.reg(REG_TO_SIZE[oper_val], reg2str(oper_val)) expr = il.push(REG_TO_SIZE[oper_val], subexpr)
Note that both the helper function
il.push() need the size of the register access and push, respectively, which necessitates the
REG_TO_SIZE lookup. And even if you don’t know anything about program analysis, doesn’t it make sense that in order to analyze a push, you’d need to know how many bytes are going on the stack? And to analyze a register access, you’d need to know how large the register is? Returning to our imagined tree,
LLIL_PUSH sits at the root, and its one operand is a branch to a subexpression node
Let’s confirm this works when pushing registers in the disassembly view:
And LLIL view:
See checkpoint #3.
LLIL: Thinking in LLIL
As you lift more instructions, your mental library of LLIL instructions will increase, and your need to consult lowlevelil.py will decrease.
Exercise #4: LD immediate to register
Consider the instruction
LD DE, 0x7000 which loads the value 0x7000 into register DE. A function view might be
ld(reg("de"), 0x7000). Since it writes to a register, we find a fit with
LLIL_SET_REG and look to the prototype of the corresponding helper function
set_reg() from lowlevelil.py:
def set_reg(self, size:int, reg:'architecture.RegisterType', value:ExpressionIndex, flags:Optional['architecture.FlagType']=None) -> ExpressionIndex:
It only takes one expression argument. This helper function will automatically form the register operand for us using the register we supply. We’ll also make an educated guess concerning whether the immediate is an address or not by testing whether it’s a nonzero 2-byte value. Non-address values map to
LLIL_CONST while address values map to
elif decoded.op == OP.LD: (oper_type, oper_val) = decoded.operands (operb_type, operb_val) = decoded.operands if oper_type == OPER_TYPE.REG: size = REG_TO_SIZE[oper_val] subexpr = None if oper_type == OPER_TYPE.REG and operb_type == OPER_TYPE.IMM: if size == 2 and operb_val != 0: subexpr = il.const_pointer(2, operb_val) else: subexpr = il.const(size, operb_val) expr = il.set_reg(size, reg2str(oper_val), subexpr)
The resulting tree is:
See checkpoint #4.
Exercise #5: LD register to memory
Consider the instruction
LD (0x7002),A which writes the value in register A to address 0x7002. Functionally, we could write perhaps
LD(0x7002,reg("A")). Writing a value to memory matches
LLIL_STORE and its corresponding helper function prototype is:
def store(self, size:int, addr:ExpressionIndex, value:ExpressionIndex, flags=None) -> ExpressionIndex:
Notice we have to supply two subexpressions here. None will be built automatically from the helper like was the case with helper
elif decoded.op == OP.LD: #... elif oper_type == OPER_TYPE.ADDR_DEREF: if operb_type == OPER_TYPE.REG: size = REG_TO_SIZE[operb_val] subexpr_a = il.const_pointer(size, oper_val) subexpr_b = il.reg(size, reg2str(operb_val)) expr = il.store(size, subexpr_a, subexpr_b)
The resulting tree is predictable given the two subexpressions:
See checkpoint #5.
Factoring out operand lifting
It’s common when lifting an instruction that we must generate LLIL for operands. Rather than replicate this for each instruction, we should have some common utility. Indeed this is done in almost all Vector35’s architectures:
|Architecture||Operand Lifting Utilities|
|6502 / NES||
The upfront cost of implementing such a function will pay dividends as it’s used when lifting future instructions. It’s also not a bad time to fully familiarize ourselves with the target architecture and the operand types reported by the disassembler. Let’s survey the types of operands of Z80:
Register operands map easily to
if oper_type == OPER_TYPE.REG: return il.reg(REG_TO_SIZE[oper_val], reg2str(oper_val))
Dereferenced register operands map to
elif oper_type == OPER_TYPE.REG_DEREF: tmp = il.reg(REG_TO_SIZE[oper_val], reg2str(oper_val)) return il.load(size_hint, tmp)
Address operands map to
elif oper_type == OPER_TYPE.ADDR: return il.const_pointer(2, oper_val)
Dereferenced address operands map to
elif oper_type == OPER_TYPE.ADDR_DEREF: tmp = il.const_pointer(2, oper_val) return il.load(size_hint, tmp)
Dereferenced indexed memory operands map to
LLIL_LOAD of an address formed by the addition
LLIL_ADD of a register
LLIL_REG and an offset
elif oper_type in [OPER_TYPE.MEM_DISPL_IX, OPER_TYPE.MEM_DISPL_IY]: reg_name = 'IX' if oper_type == OPER_TYPE.MEM_DISPL_IX else 'IY' tmp = il.add(2, il.reg(2, reg_name), il.const(1, oper_val)) return il.load(size_hint, tmp)
Immediates map easily to
elif oper_type == OPER_TYPE.IMM: return il.const(size_hint, oper_val)
Exercise #6: generalized operand lifter
Try to implement the utility function
operand_to_il(). It must be given the operand type (as returned from the disassembler), operand value, the LowLevelILFunction, and the size of the parameters:
def operand_to_il(self, oper_type, oper_val, il, size_hint=0):
It’s also happens to be convenient when lifting Z80 to have the ability to ignore the load of the dereferenced operands. For instance, when
(0x7002) is the left hand side (destination) of an
LD instruction, we don’t want to actually lift a load from there, just lift as an address. But when it’s the right hand side (source) of an
LD, we do want to lift as an
LLIL_LOAD. So we introduce a helpful
ignore_load flag to our utility function’s prototype.
def operand_to_il(self, oper_type, oper_val, il, size_hint=0, ignore_load=False):
See the new function
operand_to_il() and our current lifting implementation adapted to use it in checkpoint #6.
LLIL: Exact fits and getting inventive
Every example lift so far has had nearly a one-to-one correspondence between assembly instructions and LLIL. A natural question to ask is “Can any assembly instruction be described by LLIL?”. Definitely not directly, as the number of instructions in modern ISAs vastly exceed the number of instructions in LLIL.
This section is a lesson that LLIL won’t always have an exact fit for the assembly instruction. Sometimes you need to cobble something together.
Exercise #7: lifting push AF
We currently lift
PUSH AF as an
LLIL_REG AF. And while AF is a register pair in that it can be the object of register reads and pushes and what not, access to it is special in Z80 because the low 8 bits (the “F” part) map to the flags.
Push can only target AF, BC, DE, HL, IX, IY, so let’s test when AF is the object:
# when pushing AF, actually push the flags if oper_val == REG.AF:
We need to form the 2-byte value that goes on the stack. The high 8 bits will be register
A but the low 8 bits must contain the flags:
But there’s no
LLIL_HANDLE_THESE_Z80_FLAGS. So what do we do? Imagine how you might do it in C: shifts and ors. We’ll synthesize it with
expr = il.push(2, il.or_expr(2, il.or_expr(1, il.or_expr(1, il.shift_left(1, il.flag('s'), il.const(1, 7)), il.shift_left(1, il.flag('z'), il.const(1, 6)) ), il.or_expr(1, il.or_expr(1, il.shift_left(1, il.flag('h'), il.const(1, 4)), il.shift_left(1, il.flag('pv'), il.const(1, 2)) ), il.or_expr(1, il.shift_left(1, il.flag('n'), il.const(1, 1)), il.flag('c') ) ) ), il.shift_left(2, il.reg(1, 'A'), il.const(1, 8) )))
I think it’s cool how the tree structure can been seen in the the code that constructs the expression. Test it out by flipping to the LLIL view to confirm:
See checkpoint #7.
LLIL: Lifting branches
Unconditional branches that have a computed target present no new challenges. Call to an address operand maps to
elif decoded.op == OP.CALL: if oper_type == OPER_TYPE.ADDR: expr = il.call(il.const_pointer(2, oper_val))
RET in Z80) takes as input how much the stack is adjusted:
elif decoded.op == OP.RET: if oper_type == None: expr = il.ret(il.pop(2))
Jumps to an absolute address (
JP in Z80) or relative address (
JR in Z80) map to
LLIL_GOTO. The former takes as input an expression that evaluates to an address, and the latter takes a label, which is a name that can be attached to IL locations. Note the disassembler z80dis uniformly returns an effective address, accounting for the displacement if the jump is relative. We can test whether a label exists for this effective address using the
il.get_label_for_address() function. When it succeeds, we’ll use
elif decoded.op in [OP.JP, OP.JR]: if oper_type == OPER_TYPE.ADDR: tmp = il.get_label_for_address(Architecture['Z80'], oper_val) if tmp: expr = il.goto(tmp) else: expr = il.jump(il.const_pointer(2, oper_val))
Exercise #8: Lift CALL, RET, JP, JR
See if you can get these implementations emplaced and working. Scroll through the binary and make sure calls, returns, and jumps make sense. Well, the unconditional ones, anyways.
See checkpoint #8.
LLIL: Lifting conditional instructions
The previous lesson about branches ignored the possibility of an operand that would make
JP conditional. So how do we do those? The answer is LLIL provides an assembly-language like test and goto instruction called
LLIL_IF which can point at labels we insert in code. The general form is something like:
LLIL_IF <condition, label_true, label_false> label_true: (lifted instruction) label_false:
The steps to construct this are:
- create an expression for the condition test
- allocate label(s) using
LowLevelILLabel()call from lowlevelil.py
- create an
il.if(...)supplying the condition test and two labels and append it to
- mark the current spot with the “true” label using
- append code that should be executed for this label
- mark the current spot with the “false” label using
What makes this convenient is that the labels you allocate aren’t committed to any address. You can provide them immediately to
il.if() and only later commit with
Exercise #9: Lifting conditional RET
The only operand
RET can take is one of the conditional flags, so the presence of an operand means it’s conditional.
elif decoded.op == OP.RET: if oper_type == None: expr = il.ret(il.pop(2)) else: antecedent = self.jcc_to_flag_cond(oper_val, il) # step 1 t = LowLevelILLabel() # step 2 f = LowLevelILLabel() il.append(il.if_expr(antecedent, t, f)) # step 3 il.mark_label(t) # step 4 il.append(instr) # step 5 il.mark_label(f) # step 6
How was that antecedent formed though? Just as we had a helper function
operand_to_il() for normal operands, we need a helper function
jcc_to_flag_cond() for condition codes. HINT: use
il.not_expr() to invert them.
See checkpoint #9 for the solution. In
sub_882f() we have a conditional
RET on the zero flag:
And the lift has it executing the
z is true, and jumping to the
DEC (IX) otherwise:
How can you know if your lifting is accurate? I wish I had more formal advice to offer, but the effective techniques so far are:
- Manual inspection
- Make sure the resulting analysis looks sensical. After you lift an instruction, inspect the LLIL, then the MLIL, then the HLIL.
- Test driven development
- Start with what you believe should be the lifted output of several assembly instructions, then develop until the test is met. This is the approach taken for over 2000 cases in the arch-arm64 lifting test.
- Run a test program vs. a compiled version of a lifted test program. If the outputs match, your lift is likely very accurate. The llil_transpiler project has a target for Z80 using the small device C compiler (SDCC) and the implementation of emulated of LLIL instructions stands independently as a nice reference of what LLIL instructions are intended to do.
None of these truly “verify” an accurate lift, they just increase the confidence.
Q & A
- What is the difference between LLIL and Lifted LLIL?
- Lifted LLIL is the raw product of an architecture plugin’s call to
get_instruction_low_level_il(). Everything we produced in this post is is Lifted LLIL. It’s the crude oil straight from the ground. After some refinement you get LLIL. A series of
LLIL_NOPis collapsed into a single NOP, for example.
- What is the difference between LLIL and BNIL?
- BNIL is an umbrella term for all the levels of IL Binja uses, and Lifted LLIL is at the bottom level.
- What is the difference between an expression and an actual LLIL instruction?
- An expression (or expression index) is an internal numeric identifier for a portion of an LLIL instruction’s full tree and is returned by most of the helper functions. Recall the tree for the x86 instruction
lea eax, [edx+ecx*4]:
LLIL_SET_REG ├─── eax └─── LLIL_ADD ├─── LLIL_REG │ └─── edx └─── LLIL_MUL ├─── LLIL_REG │ └─── ecx └─── LLIL_CONST └─── 2
This might be stored internally as:
|ExpressionIndex||Instruction||Operand 0||Operand 1|
Thus there are six total expressions for this instruction and the instruction is technically itself an expression, with index 0.
- What is the difference between an expression and an instruction index?
An expression (or expression index) is a per-instruction identifier for a part of an instruction’s tree. See above.
An instruction index is a per-function identifier for instructions.
To reiterate, functions consist of instructions, indexed by instruction indices. Each instruction consists of expressions, indexed by expression indices.
I hope this post has been helpful in getting you started lifting! If you’re looking for more exercises, I recommend you extend the conditional
RET to conditional
JR. You can always restart the tutorial by checking out branch tutorial2_start from the Z80 git repo or simply re-clicking the checkpoints. Spoilers can be had by checking out branch master.
We welcome feedback. Is there something you wish were covered in more detail? Less? Do you have a question that belongs in the Q&A?
Hop on the
#api-help channel in our community slack.