summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorOxore <oxore@protonmail.com>2025-01-19 00:36:58 +0300
committerOxore <oxore@protonmail.com>2025-02-01 18:26:18 +0300
commit8340b1f42288e0143bca8a254600fb34025ec803 (patch)
treea7ff2b38198c01eff7ae11f49d55c82f1be86f6f
parentbea4c5538e287cd3b5943c1e45e8b24c5b462cb4 (diff)
WIP doc
-rw-r--r--doc.md273
1 files changed, 273 insertions, 0 deletions
diff --git a/doc.md b/doc.md
new file mode 100644
index 0000000..3dbdb11
--- /dev/null
+++ b/doc.md
@@ -0,0 +1,273 @@
+# Disassembly rules
+
+There are several cases and features of the disassembly process that must be
+discussed before implemented. It also may serve as a documentation after the
+features has already been implemented.
+
+## Unaligned data and instructions
+
+All instructions of the Motorola 68000 ISA sizes are multiples of 2. The
+original M68000 ISA (not 680010 or other) also does not support unaligned
+instruction access, i.e. jumps to address with lowest bit set are invalid and
+will lead to whatever on the real hardware, I don't know exactly what would it
+be. But anyway the m68k-disasm is about to get support for such instructions,
+because:
+
+- At least GNU AS and Sierra ASM68 support it;
+
+- Data may be unaligned without a problem, hence allowing instructions to be
+ unaligned will yield a consistent implementation of the disassembly algorithm
+ across all the things the disassembler emits.
+
+The only way unaligned instruction execution may happen that I can see now is
+jumping into an unaligned location.
+
+## Jumping into the middle of an instruction
+
+It is unlikely to happen in ф real binary, produced by an assembler, but
+assemblers are very capable of producing such code, and even more than that: it
+may work without a problem, since a part of a long instruction may be a valid
+short instruction. A reference to a location for jumping into the middle of an
+instruction may be done with simple arithmetic like this (GNU AS syntax):
+
+```asm
+label:
+ andiw #0x4e71,%d1
+ bras label+2
+```
+
+Which may be disassembled just fine:
+
+```asm
+L00000000:
+ andiw #20081,%d1 | 0241 4e71 @00000000
+ bras L00000000+2 | 60fc @00000004
+```
+
+But this disassemble does not consider the `nop` instruction hidden inside the
+`andiw` instruction. Here is another disassemble variant of the same code just
+to show the `nop` instruction, obtained using a trace table with PC trace
+entries on addresses `0`, `2` and `4`:
+
+```asm
+ .short 0x0241 | 0241 @00000000
+L00000002:
+ nop | 4e71 @00000002
+ bras L00000002 | 60fc @00000004
+```
+
+## Unaligned jump into the middle of an instruction
+
+But what about unaligned jump?
+
+```asm
+label:
+ andiw #0x4e71,%d1
+ bras label+3
+```
+
+The listing is valid, the assembler will emit exactly what it represents, but
+now it dos not make any sense, since `7160` is not a valid instruction. But if
+it was `7060`, then it would be valid, because `7060` is `moveq #60,%d0`.
+Although it is not guaranteed to work on the real hardware, so it could be an
+attribute for some heuristics.
+
+On the other hand the approach of m68k-disasm is to not try to be smarty pants
+about everything it handles, at least by default. The priority is producing a
+listing that will be translated by an assembler into a binary file that is 100%
+identical to the initial binary being disassembled, no matter what.
+
+... TODO ...
+
+
+## Two modes of operation (QuickPeek mode)
+
+There are basically two modes of operation in m68k-disasm disassembler:
+**QuickPeek** mode and **Proper** mode.
+
+In QuickPeek mode there are not may rules. It starts at offset `0` of the file,
+takes 2 bytes and tries to interpret them as an instruction. If an instruction
+takes more than two bytes, all other bytes are taken into account. A Node is
+created that occupies as many slots as it needs in AddressSpace. A Node, as well
+as AddressSpace are internal structures. AddressSpace is just an array of
+pointers to Nodes and it can fit up to 4Mi pointers in it. Let's say there is
+`nop` (`4e71`) instruction at offset `0`. A single Node is created and placed at
+slots `0` and `1` in AddressSpace. Next let's say `andiw #0x03ff,%d1`
+(`024003ff`) instruction encountered, which takes 4 bytes, so it's Node is
+created that occupies 4 slots in AddressSpace at offsets `3`, `4`, `5` and `6`.
+And so on, and so on.
+
+If an instruction references a memory location as code, i.e. it jumps/branches
+to a location not yet handled by the disassembler, then a Node is being created
+at the location, but it will preserve 2 bytes alignment. For example, an
+encounter of an instruction branching to offset `17` will create a Node at
+offset `16` of size 2, if it does not exist, and it won't be disassembled yet.
+It will be marked as *not handled* in some way. When disassembler finally
+reaches address `16` by walking over instructions one by one, it will try to
+disassemble it go futher... Except if it won't turn out that an instruction at
+offset `14` takes 4 bytes when disassembled and virtually *includes* the
+instruction located at offset `16`. This will lead to a Node at location `16` to
+be moved to location `14` and disassembled instead. And, as a result, the
+instruction will cover both locations, at `14` and at `16`, taking 4 slots in
+AddressSpace in total.
+
+One byte of the binary data being disassembled corresponds to one slot in
+AddressSpace, as you have probably noticed already.
+
+What if an instruction being disassembled is invalid? Well, it is left intact
+then and emitted as `.short 0xXXXX` line, with label above it, in case if it has
+been referenced from somewhere.
+
+Hence the rules of the QuickPeek modes are such:
+
+- Disassemble always starts at offset `0`
+
+- Disassemble only takes place at address aligned to 2, i.e. even offsets.
+
+- References to unaligned locations (odd offsets) are expressed regarding to
+ aligned locations only, with some arithmetic.
+
+- No Nodes are placed at unaligned addresses (odd offsets).
+
+- Referencing forward (not yet handled) locations create unhandled Nodes.
+
+- Unhandled nodes may be extended backwards and forwards when reached by
+ disassemble procedure, retaining all references to them. But unhandled Nodes
+ are never shrunk or removed.
+
+## Proper mode
+
+The Proper mode is only activated when a trace table has been provided
+explicitly or if the binary format supports and contains entry point or more
+information about code and data locations like functions or just labels. In
+contrast to QuickPeek mode nothing is assumed if not specified explicitly.
+Everything is considered possible, even unaligned instructions. Hence Nodes may
+be placed even on odd offsets like `17` and occupy odd number of slots in
+AddressSpace , but not less than 1 slot.
+
+But instead of disassembling everything, Proper mode takes guidance from the
+trace table data. It starts at the lowest address specified as code (PC trace or
+function) in the trace table and disassembles from this point.
+
+If Node is a PC traced Node, it is disassembled and everything after it is
+skipped until another PC trace Node or function Node encountered. Unless
+`-fwalk` feature flag is given.
+
+When `-fwalk` feature flag is specified and Node is a PC trace Node, it behaves
+similarly to the QuickPeek mode and tries to disassemble instructions one by one
+until it reaches either a) traced data Node b) unconditional control flow
+instruction like `rts`, `jmp`, `bsr` and so on c) invalid instruction.
+
+If Node is a function Node then... well, it does not make much sense in having
+function Nodes at all, because it will span across multiple instructions, but a)
+it can be only one instruction per node and b) it can only be one node per slot.
+It should probably be in other relationship, maybe even half-implicit, like ELF
+symbols are now.
+
+# Internal stuff
+
+## AddressSpace, Nodes, Regions and Symbols
+
+Address space contains an array of `4 * 1024 * 1024` slots. Each slot is just a
+pointer to a Node. Each Node may point to a single Region if the Node belongs to
+one or more Symbols. A Region is a dynamically allocated array of pointers to
+Symbols, that encompass this Region. A Symbol is one of the following:
+
+- A function;
+
+- A data span like string, blob, table of pointers or even more complex table.
+
+Imagine it like this. Symbol may intersect, which mean, for example, a function
+may contain a string at it's end, but it nevertheless is a part of the function,
+so it is an intersection. A way to represent this intersection is an
+intermediate entity called Region. Hence, multiple Nodes may share a Region and
+multiple Symbols may share a Region. Basically a Region, that refers to more
+than one Symbol represents a span of intersection of these Symbols.
+
+A little visualization of how Regions correspond to Symbols, each dot represents
+a byte, that form a span with the adjacent dots:
+
+```
+Symbol(fn) | .........................
+Symbol(strz) | .........
+Symbol(str) | ..
+Symbol(strz) | .....
+Symbol(blob) | ......
+Symbol(strz) | .....
+Region | ........
+Region | ..
+Region | ..
+Region | .....
+Region | ........
+Region | ......
+Region | .....
+Node(pc) | ..
+Node(pc) | ....
+Node(pc) | ..
+Node(data) | ..
+Node(data) | ..
+Node(data) | .....
+Node(pc) | ..
+Node(pc) | ....
+Node(pc) | ..
+Node(data) | ......
+Node(data) | .....
+```
+
+This is how Nodes refer to Regions by holding pointers to them:
+
+```
+Symbol(fn) | .........................
+Symbol(strz) | .........
+Symbol(str) | ..
+Symbol(strz) | .....
+Symbol(blob) | ......
+Symbol(strz) | .....
+Region | ........
+Region | ^ ^ ^ ..
+Region | | | | ^ ..
+Region | | | | | ^ .....
+Region | | | | | | ^ ........
+Region | | | | | | | ^ ^ ^ ......
+Region | | | | | | | | | | ^ .....
+Node(pc) | ..| | | | | | | | | ^
+Node(pc) | ....| | | | | | | | |
+Node(pc) | ..| | | | | | | |
+Node(data) | ..| | | | | | |
+Node(data) | ..| | | | | |
+Node(data) | .....| | | | |
+Node(pc) | ..| | | |
+Node(pc) | ....| | |
+Node(pc) | ..| |
+Node(data) | ......|
+Node(data) | .....
+```
+
+This is how Regions refer to Symbols by holding pointers to them.
+
+```
+Symbol(fn) | .........................
+Symbol(strz) | ^ .........^
+Symbol(str) | | ^ .. |
+Symbol(strz) | | | ^ .....|
+Symbol(blob) | | | | ^ | ......
+Symbol(strz) | | | | | | ^ .....
+Region | ........| | | | | ^
+Region | ..| | | | |
+Region | ..| | | |
+Region | .....| | |
+Region | ........| |
+Region | ......|
+Region | .....
+Node(pc) | ..
+Node(pc) | ....
+Node(pc) | ..
+Node(data) | ..
+Node(data) | ..
+Node(data) | .....
+Node(pc) | ..
+Node(pc) | ....
+Node(pc) | ..
+Node(data) | ......
+Node(data) | .....
+```