From 8340b1f42288e0143bca8a254600fb34025ec803 Mon Sep 17 00:00:00 2001 From: Oxore Date: Sun, 19 Jan 2025 00:36:58 +0300 Subject: WIP doc --- doc.md | 273 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 273 insertions(+) create mode 100644 doc.md diff --git a/doc.md b/doc.md new file mode 100644 index 0000000..3dbdb11 --- /dev/null +++ b/doc.md @@ -0,0 +1,273 @@ +# Disassembly rules + +There are several cases and features of the disassembly process that must be +discussed before implemented. It also may serve as a documentation after the +features has already been implemented. + +## Unaligned data and instructions + +All instructions of the Motorola 68000 ISA sizes are multiples of 2. The +original M68000 ISA (not 680010 or other) also does not support unaligned +instruction access, i.e. jumps to address with lowest bit set are invalid and +will lead to whatever on the real hardware, I don't know exactly what would it +be. But anyway the m68k-disasm is about to get support for such instructions, +because: + +- At least GNU AS and Sierra ASM68 support it; + +- Data may be unaligned without a problem, hence allowing instructions to be + unaligned will yield a consistent implementation of the disassembly algorithm + across all the things the disassembler emits. + +The only way unaligned instruction execution may happen that I can see now is +jumping into an unaligned location. + +## Jumping into the middle of an instruction + +It is unlikely to happen in ф real binary, produced by an assembler, but +assemblers are very capable of producing such code, and even more than that: it +may work without a problem, since a part of a long instruction may be a valid +short instruction. A reference to a location for jumping into the middle of an +instruction may be done with simple arithmetic like this (GNU AS syntax): + +```asm +label: + andiw #0x4e71,%d1 + bras label+2 +``` + +Which may be disassembled just fine: + +```asm +L00000000: + andiw #20081,%d1 | 0241 4e71 @00000000 + bras L00000000+2 | 60fc @00000004 +``` + +But this disassemble does not consider the `nop` instruction hidden inside the +`andiw` instruction. Here is another disassemble variant of the same code just +to show the `nop` instruction, obtained using a trace table with PC trace +entries on addresses `0`, `2` and `4`: + +```asm + .short 0x0241 | 0241 @00000000 +L00000002: + nop | 4e71 @00000002 + bras L00000002 | 60fc @00000004 +``` + +## Unaligned jump into the middle of an instruction + +But what about unaligned jump? + +```asm +label: + andiw #0x4e71,%d1 + bras label+3 +``` + +The listing is valid, the assembler will emit exactly what it represents, but +now it dos not make any sense, since `7160` is not a valid instruction. But if +it was `7060`, then it would be valid, because `7060` is `moveq #60,%d0`. +Although it is not guaranteed to work on the real hardware, so it could be an +attribute for some heuristics. + +On the other hand the approach of m68k-disasm is to not try to be smarty pants +about everything it handles, at least by default. The priority is producing a +listing that will be translated by an assembler into a binary file that is 100% +identical to the initial binary being disassembled, no matter what. + +... TODO ... + + +## Two modes of operation (QuickPeek mode) + +There are basically two modes of operation in m68k-disasm disassembler: +**QuickPeek** mode and **Proper** mode. + +In QuickPeek mode there are not may rules. It starts at offset `0` of the file, +takes 2 bytes and tries to interpret them as an instruction. If an instruction +takes more than two bytes, all other bytes are taken into account. A Node is +created that occupies as many slots as it needs in AddressSpace. A Node, as well +as AddressSpace are internal structures. AddressSpace is just an array of +pointers to Nodes and it can fit up to 4Mi pointers in it. Let's say there is +`nop` (`4e71`) instruction at offset `0`. A single Node is created and placed at +slots `0` and `1` in AddressSpace. Next let's say `andiw #0x03ff,%d1` +(`024003ff`) instruction encountered, which takes 4 bytes, so it's Node is +created that occupies 4 slots in AddressSpace at offsets `3`, `4`, `5` and `6`. +And so on, and so on. + +If an instruction references a memory location as code, i.e. it jumps/branches +to a location not yet handled by the disassembler, then a Node is being created +at the location, but it will preserve 2 bytes alignment. For example, an +encounter of an instruction branching to offset `17` will create a Node at +offset `16` of size 2, if it does not exist, and it won't be disassembled yet. +It will be marked as *not handled* in some way. When disassembler finally +reaches address `16` by walking over instructions one by one, it will try to +disassemble it go futher... Except if it won't turn out that an instruction at +offset `14` takes 4 bytes when disassembled and virtually *includes* the +instruction located at offset `16`. This will lead to a Node at location `16` to +be moved to location `14` and disassembled instead. And, as a result, the +instruction will cover both locations, at `14` and at `16`, taking 4 slots in +AddressSpace in total. + +One byte of the binary data being disassembled corresponds to one slot in +AddressSpace, as you have probably noticed already. + +What if an instruction being disassembled is invalid? Well, it is left intact +then and emitted as `.short 0xXXXX` line, with label above it, in case if it has +been referenced from somewhere. + +Hence the rules of the QuickPeek modes are such: + +- Disassemble always starts at offset `0` + +- Disassemble only takes place at address aligned to 2, i.e. even offsets. + +- References to unaligned locations (odd offsets) are expressed regarding to + aligned locations only, with some arithmetic. + +- No Nodes are placed at unaligned addresses (odd offsets). + +- Referencing forward (not yet handled) locations create unhandled Nodes. + +- Unhandled nodes may be extended backwards and forwards when reached by + disassemble procedure, retaining all references to them. But unhandled Nodes + are never shrunk or removed. + +## Proper mode + +The Proper mode is only activated when a trace table has been provided +explicitly or if the binary format supports and contains entry point or more +information about code and data locations like functions or just labels. In +contrast to QuickPeek mode nothing is assumed if not specified explicitly. +Everything is considered possible, even unaligned instructions. Hence Nodes may +be placed even on odd offsets like `17` and occupy odd number of slots in +AddressSpace , but not less than 1 slot. + +But instead of disassembling everything, Proper mode takes guidance from the +trace table data. It starts at the lowest address specified as code (PC trace or +function) in the trace table and disassembles from this point. + +If Node is a PC traced Node, it is disassembled and everything after it is +skipped until another PC trace Node or function Node encountered. Unless +`-fwalk` feature flag is given. + +When `-fwalk` feature flag is specified and Node is a PC trace Node, it behaves +similarly to the QuickPeek mode and tries to disassemble instructions one by one +until it reaches either a) traced data Node b) unconditional control flow +instruction like `rts`, `jmp`, `bsr` and so on c) invalid instruction. + +If Node is a function Node then... well, it does not make much sense in having +function Nodes at all, because it will span across multiple instructions, but a) +it can be only one instruction per node and b) it can only be one node per slot. +It should probably be in other relationship, maybe even half-implicit, like ELF +symbols are now. + +# Internal stuff + +## AddressSpace, Nodes, Regions and Symbols + +Address space contains an array of `4 * 1024 * 1024` slots. Each slot is just a +pointer to a Node. Each Node may point to a single Region if the Node belongs to +one or more Symbols. A Region is a dynamically allocated array of pointers to +Symbols, that encompass this Region. A Symbol is one of the following: + +- A function; + +- A data span like string, blob, table of pointers or even more complex table. + +Imagine it like this. Symbol may intersect, which mean, for example, a function +may contain a string at it's end, but it nevertheless is a part of the function, +so it is an intersection. A way to represent this intersection is an +intermediate entity called Region. Hence, multiple Nodes may share a Region and +multiple Symbols may share a Region. Basically a Region, that refers to more +than one Symbol represents a span of intersection of these Symbols. + +A little visualization of how Regions correspond to Symbols, each dot represents +a byte, that form a span with the adjacent dots: + +``` +Symbol(fn) | ......................... +Symbol(strz) | ......... +Symbol(str) | .. +Symbol(strz) | ..... +Symbol(blob) | ...... +Symbol(strz) | ..... +Region | ........ +Region | .. +Region | .. +Region | ..... +Region | ........ +Region | ...... +Region | ..... +Node(pc) | .. +Node(pc) | .... +Node(pc) | .. +Node(data) | .. +Node(data) | .. +Node(data) | ..... +Node(pc) | .. +Node(pc) | .... +Node(pc) | .. +Node(data) | ...... +Node(data) | ..... +``` + +This is how Nodes refer to Regions by holding pointers to them: + +``` +Symbol(fn) | ......................... +Symbol(strz) | ......... +Symbol(str) | .. +Symbol(strz) | ..... +Symbol(blob) | ...... +Symbol(strz) | ..... +Region | ........ +Region | ^ ^ ^ .. +Region | | | | ^ .. +Region | | | | | ^ ..... +Region | | | | | | ^ ........ +Region | | | | | | | ^ ^ ^ ...... +Region | | | | | | | | | | ^ ..... +Node(pc) | ..| | | | | | | | | ^ +Node(pc) | ....| | | | | | | | | +Node(pc) | ..| | | | | | | | +Node(data) | ..| | | | | | | +Node(data) | ..| | | | | | +Node(data) | .....| | | | | +Node(pc) | ..| | | | +Node(pc) | ....| | | +Node(pc) | ..| | +Node(data) | ......| +Node(data) | ..... +``` + +This is how Regions refer to Symbols by holding pointers to them. + +``` +Symbol(fn) | ......................... +Symbol(strz) | ^ .........^ +Symbol(str) | | ^ .. | +Symbol(strz) | | | ^ .....| +Symbol(blob) | | | | ^ | ...... +Symbol(strz) | | | | | | ^ ..... +Region | ........| | | | | ^ +Region | ..| | | | | +Region | ..| | | | +Region | .....| | | +Region | ........| | +Region | ......| +Region | ..... +Node(pc) | .. +Node(pc) | .... +Node(pc) | .. +Node(data) | .. +Node(data) | .. +Node(data) | ..... +Node(pc) | .. +Node(pc) | .... +Node(pc) | .. +Node(data) | ...... +Node(data) | ..... +``` -- cgit v1.2.3