# Disassembly rules There are several cases and features of the disassembly process that must be discussed before implemented. It also may serve as a documentation after the features has already been implemented. ## Unaligned data and instructions All instructions of the Motorola 68000 ISA sizes are multiples of 2. The original M68000 ISA (not 680010 or other) also does not support unaligned instruction access, i.e. jumps to address with lowest bit set are invalid and will lead to whatever on the real hardware, I don't know exactly what would it be. But anyway the m68k-disasm is about to get support for such instructions, because: - At least GNU AS and Sierra ASM68 support it; - Data may be unaligned without a problem, hence allowing instructions to be unaligned will yield a consistent implementation of the disassembly algorithm across all the things the disassembler emits. The only way unaligned instruction execution may happen that I can see now is jumping into an unaligned location. ## Jumping into the middle of an instruction It is unlikely to happen in ф real binary, produced by an assembler, but assemblers are very capable of producing such code, and even more than that: it may work without a problem, since a part of a long instruction may be a valid short instruction. A reference to a location for jumping into the middle of an instruction may be done with simple arithmetic like this (GNU AS syntax): ```asm label: andiw #0x4e71,%d1 bras label+2 ``` Which may be disassembled just fine: ```asm L00000000: andiw #20081,%d1 | 0241 4e71 @00000000 bras L00000000+2 | 60fc @00000004 ``` But this disassemble does not consider the `nop` instruction hidden inside the `andiw` instruction. Here is another disassemble variant of the same code just to show the `nop` instruction, obtained using a trace table with PC trace entries on addresses `0`, `2` and `4`: ```asm .short 0x0241 | 0241 @00000000 L00000002: nop | 4e71 @00000002 bras L00000002 | 60fc @00000004 ``` ## Unaligned jump into the middle of an instruction But what about unaligned jump? ```asm label: andiw #0x4e71,%d1 bras label+3 ``` The listing is valid, the assembler will emit exactly what it represents, but now it dos not make any sense, since `7160` is not a valid instruction. But if it was `7060`, then it would be valid, because `7060` is `moveq #60,%d0`. Although it is not guaranteed to work on the real hardware, so it could be an attribute for some heuristics. On the other hand the approach of m68k-disasm is to not try to be smarty pants about everything it handles, at least by default. The priority is producing a listing that will be translated by an assembler into a binary file that is 100% identical to the initial binary being disassembled, no matter what. ... TODO ... ## Two modes of operation (QuickPeek mode) There are basically two modes of operation in m68k-disasm disassembler: **QuickPeek** mode and **Proper** mode. In QuickPeek mode there are not may rules. It starts at offset `0` of the file, takes 2 bytes and tries to interpret them as an instruction. If an instruction takes more than two bytes, all other bytes are taken into account. A Node is created that occupies as many slots as it needs in AddressSpace. A Node, as well as AddressSpace are internal structures. AddressSpace is just an array of pointers to Nodes and it can fit up to 4Mi pointers in it. Let's say there is `nop` (`4e71`) instruction at offset `0`. A single Node is created and placed at slots `0` and `1` in AddressSpace. Next let's say `andiw #0x03ff,%d1` (`024003ff`) instruction encountered, which takes 4 bytes, so it's Node is created that occupies 4 slots in AddressSpace at offsets `3`, `4`, `5` and `6`. And so on, and so on. If an instruction references a memory location as code, i.e. it jumps/branches to a location not yet handled by the disassembler, then a Node is being created at the location, but it will preserve 2 bytes alignment. For example, an encounter of an instruction branching to offset `17` will create a Node at offset `16` of size 2, if it does not exist, and it won't be disassembled yet. It will be marked as *not handled* in some way. When disassembler finally reaches address `16` by walking over instructions one by one, it will try to disassemble it go futher... Except if it won't turn out that an instruction at offset `14` takes 4 bytes when disassembled and virtually *includes* the instruction located at offset `16`. This will lead to a Node at location `16` to be moved to location `14` and disassembled instead. And, as a result, the instruction will cover both locations, at `14` and at `16`, taking 4 slots in AddressSpace in total. One byte of the binary data being disassembled corresponds to one slot in AddressSpace, as you have probably noticed already. What if an instruction being disassembled is invalid? Well, it is left intact then and emitted as `.short 0xXXXX` line, with label above it, in case if it has been referenced from somewhere. Hence the rules of the QuickPeek modes are such: - Disassemble always starts at offset `0` - Disassemble only takes place at address aligned to 2, i.e. even offsets. - References to unaligned locations (odd offsets) are expressed regarding to aligned locations only, with some arithmetic. - No Nodes are placed at unaligned addresses (odd offsets). - Referencing forward (not yet handled) locations create unhandled Nodes. - Unhandled nodes may be extended backwards and forwards when reached by disassemble procedure, retaining all references to them. But unhandled Nodes are never shrunk or removed. ## Proper mode The Proper mode is only activated when a trace table has been provided explicitly or if the binary format supports and contains entry point or more information about code and data locations like functions or just labels. In contrast to QuickPeek mode nothing is assumed if not specified explicitly. Everything is considered possible, even unaligned instructions. Hence Nodes may be placed even on odd offsets like `17` and occupy odd number of slots in AddressSpace , but not less than 1 slot. But instead of disassembling everything, Proper mode takes guidance from the trace table data. It starts at the lowest address specified as code (PC trace or function) in the trace table and disassembles from this point. If Node is a PC traced Node, it is disassembled and everything after it is skipped until another PC trace Node or function Node encountered. Unless `-fwalk` feature flag is given. When `-fwalk` feature flag is specified and Node is a PC trace Node, it behaves similarly to the QuickPeek mode and tries to disassemble instructions one by one until it reaches either a) traced data Node b) unconditional control flow instruction like `rts`, `jmp`, `bsr` and so on c) invalid instruction. If Node is a function Node then... well, it does not make much sense in having function Nodes at all, because it will span across multiple instructions, but a) it can be only one instruction per node and b) it can only be one node per slot. It should probably be in other relationship, maybe even half-implicit, like ELF symbols are now. # Internal stuff ## AddressSpace, Nodes, Regions and Symbols Address space contains an array of `4 * 1024 * 1024` slots. Each slot is just a pointer to a Node. Each Node may point to a single Region if the Node belongs to one or more Symbols. A Region is a dynamically allocated array of pointers to Symbols, that encompass this Region. A Symbol is one of the following: - A function; - A data span like string, blob, table of pointers or even more complex table. Imagine it like this. Symbol may intersect, which mean, for example, a function may contain a string at it's end, but it nevertheless is a part of the function, so it is an intersection. A way to represent this intersection is an intermediate entity called Region. Hence, multiple Nodes may share a Region and multiple Symbols may share a Region. Basically a Region, that refers to more than one Symbol represents a span of intersection of these Symbols. A little visualization of how Regions correspond to Symbols, each dot represents a byte, that form a span with the adjacent dots: ``` Symbol(fn) | ......................... Symbol(strz) | ......... Symbol(str) | .. Symbol(strz) | ..... Symbol(blob) | ...... Symbol(strz) | ..... Region | ........ Region | .. Region | .. Region | ..... Region | ........ Region | ...... Region | ..... Node(pc) | .. Node(pc) | .... Node(pc) | .. Node(data) | .. Node(data) | .. Node(data) | ..... Node(pc) | .. Node(pc) | .... Node(pc) | .. Node(data) | ...... Node(data) | ..... ``` This is how Nodes refer to Regions by holding pointers to them: ``` Symbol(fn) | ......................... Symbol(strz) | ......... Symbol(str) | .. Symbol(strz) | ..... Symbol(blob) | ...... Symbol(strz) | ..... Region | ........ Region | ^ ^ ^ .. Region | | | | ^ .. Region | | | | | ^ ..... Region | | | | | | ^ ........ Region | | | | | | | ^ ^ ^ ...... Region | | | | | | | | | | ^ ..... Node(pc) | ..| | | | | | | | | ^ Node(pc) | ....| | | | | | | | | Node(pc) | ..| | | | | | | | Node(data) | ..| | | | | | | Node(data) | ..| | | | | | Node(data) | .....| | | | | Node(pc) | ..| | | | Node(pc) | ....| | | Node(pc) | ..| | Node(data) | ......| Node(data) | ..... ``` This is how Regions refer to Symbols by holding pointers to them. ``` Symbol(fn) | ......................... Symbol(strz) | ^ .........^ Symbol(str) | | ^ .. | Symbol(strz) | | | ^ .....| Symbol(blob) | | | | ^ | ...... Symbol(strz) | | | | | | ^ ..... Region | ........| | | | | ^ Region | ..| | | | | Region | ..| | | | Region | .....| | | Region | ........| | Region | ......| Region | ..... Node(pc) | .. Node(pc) | .... Node(pc) | .. Node(data) | .. Node(data) | .. Node(data) | ..... Node(pc) | .. Node(pc) | .... Node(pc) | .. Node(data) | ...... Node(data) | ..... ```