From 8340b1f42288e0143bca8a254600fb34025ec803 Mon Sep 17 00:00:00 2001
From: Oxore <oxore@protonmail.com>
Date: Sun, 19 Jan 2025 00:36:58 +0300
Subject: WIP doc

---
 doc.md | 273 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 273 insertions(+)
 create mode 100644 doc.md

diff --git a/doc.md b/doc.md
new file mode 100644
index 0000000..3dbdb11
--- /dev/null
+++ b/doc.md
@@ -0,0 +1,273 @@
+# Disassembly rules
+
+There are several cases and features of the disassembly process that must be
+discussed before implemented. It also may serve as a documentation after the
+features has already been implemented.
+
+## Unaligned data and instructions
+
+All instructions of the Motorola 68000 ISA sizes are multiples of 2. The
+original M68000 ISA (not 680010 or other) also does not support unaligned
+instruction access, i.e. jumps to address with lowest bit set are invalid and
+will lead to whatever on the real hardware, I don't know exactly what would it
+be. But anyway the m68k-disasm is about to get support for such instructions,
+because:
+
+- At least GNU AS and Sierra ASM68 support it;
+
+- Data may be unaligned without a problem, hence allowing instructions to be
+  unaligned will yield a consistent implementation of the disassembly algorithm
+  across all the things the disassembler emits.
+
+The only way unaligned instruction execution may happen that I can see now is
+jumping into an unaligned location.
+
+## Jumping into the middle of an instruction
+
+It is unlikely to happen in ф real binary, produced by an assembler, but
+assemblers are very capable of producing such code, and even more than that: it
+may work without a problem, since a part of a long instruction may be a valid
+short instruction. A reference to a location for jumping into the middle of an
+instruction may be done with simple arithmetic like this (GNU AS syntax):
+
+```asm
+label:
+        andiw #0x4e71,%d1
+        bras label+2
+```
+
+Which may be disassembled just fine:
+
+```asm
+L00000000:
+        andiw #20081,%d1 | 0241 4e71 @00000000
+        bras L00000000+2 | 60fc @00000004
+```
+
+But this disassemble does not consider the `nop` instruction hidden inside the
+`andiw` instruction. Here is another disassemble variant of the same code just
+to show the `nop` instruction, obtained using a trace table with PC trace
+entries on addresses `0`, `2` and `4`:
+
+```asm
+        .short 0x0241 | 0241 @00000000
+L00000002:
+        nop | 4e71 @00000002
+        bras L00000002 | 60fc @00000004
+```
+
+## Unaligned jump into the middle of an instruction
+
+But what about unaligned jump?
+
+```asm
+label:
+        andiw #0x4e71,%d1
+        bras label+3
+```
+
+The listing is valid, the assembler will emit exactly what it represents, but
+now it dos not make any sense, since `7160` is not a valid instruction. But if
+it was `7060`, then it would be valid, because `7060` is `moveq #60,%d0`.
+Although it is not guaranteed to work on the real hardware, so it could be an
+attribute for some heuristics.
+
+On the other hand the approach of m68k-disasm is to not try to be smarty pants
+about everything it handles, at least by default. The priority is producing a
+listing that will be translated by an assembler into a binary file that is 100%
+identical to the initial binary being disassembled, no matter what.
+
+... TODO ...
+
+
+## Two modes of operation (QuickPeek mode)
+
+There are basically two modes of operation in m68k-disasm disassembler:
+**QuickPeek** mode and **Proper** mode.
+
+In QuickPeek mode there are not may rules. It starts at offset `0` of the file,
+takes 2 bytes and tries to interpret them as an instruction. If an instruction
+takes more than two bytes, all other bytes are taken into account. A Node is
+created that occupies as many slots as it needs in AddressSpace. A Node, as well
+as AddressSpace are internal structures. AddressSpace is just an array of
+pointers to Nodes and it can fit up to 4Mi pointers in it. Let's say there is
+`nop` (`4e71`) instruction at offset `0`. A single Node is created and placed at
+slots `0` and `1` in AddressSpace. Next let's say `andiw #0x03ff,%d1`
+(`024003ff`) instruction encountered, which takes 4 bytes, so it's Node is
+created that occupies 4 slots in AddressSpace at offsets `3`, `4`, `5` and `6`.
+And so on, and so on.
+
+If an instruction references a memory location as code, i.e. it jumps/branches
+to a location not yet handled by the disassembler, then a Node is being created
+at the location, but it will preserve 2 bytes alignment. For example, an
+encounter of an instruction branching to offset `17` will create a Node at
+offset `16` of size 2, if it does not exist, and it won't be disassembled yet.
+It will be marked as *not handled* in some way. When disassembler finally
+reaches address `16` by walking over instructions one by one, it will try to
+disassemble it go futher... Except if it won't turn out that an instruction at
+offset `14` takes 4 bytes when disassembled and virtually *includes* the
+instruction located at offset `16`. This will lead to a Node at location `16` to
+be moved to location `14` and disassembled instead. And, as a result, the
+instruction will cover both locations, at `14` and at `16`, taking 4 slots in
+AddressSpace in total.
+
+One byte of the binary data being disassembled corresponds to one slot in
+AddressSpace, as you have probably noticed already.
+
+What if an instruction being disassembled is invalid? Well, it is left intact
+then and emitted as `.short 0xXXXX` line, with label above it, in case if it has
+been referenced from somewhere.
+
+Hence the rules of the QuickPeek modes are such:
+
+- Disassemble always starts at offset `0`
+
+- Disassemble only takes place at address aligned to 2, i.e. even offsets.
+
+- References to unaligned locations (odd offsets) are expressed regarding to
+  aligned locations only, with some arithmetic.
+
+- No Nodes are placed at unaligned addresses (odd offsets).
+
+- Referencing forward (not yet handled) locations create unhandled Nodes.
+
+- Unhandled nodes may be extended backwards and forwards when reached by
+  disassemble procedure, retaining all references to them. But unhandled Nodes
+  are never shrunk or removed.
+
+## Proper mode
+
+The Proper mode is only activated when a trace table has been provided
+explicitly or if the binary format supports and contains entry point or more
+information about code and data locations like functions or just labels. In
+contrast to QuickPeek mode nothing is assumed if not specified explicitly.
+Everything is considered possible, even unaligned instructions. Hence Nodes may
+be placed even on odd offsets like `17` and occupy odd number of slots in
+AddressSpace , but not less than 1 slot.
+
+But instead of disassembling everything, Proper mode takes guidance from the
+trace table data. It starts at the lowest address specified as code (PC trace or
+function) in the trace table and disassembles from this point.
+
+If Node is a PC traced Node, it is disassembled and everything after it is
+skipped until another PC trace Node or function Node encountered. Unless
+`-fwalk` feature flag is given.
+
+When `-fwalk` feature flag is specified and Node is a PC trace Node, it behaves
+similarly to the QuickPeek mode and tries to disassemble instructions one by one
+until it reaches either a) traced data Node b) unconditional control flow
+instruction like `rts`, `jmp`, `bsr` and so on c) invalid instruction.
+
+If Node is a function Node then... well, it does not make much sense in having
+function Nodes at all, because it will span across multiple instructions, but a)
+it can be only one instruction per node and b) it can only be one node per slot.
+It should probably be in other relationship, maybe even half-implicit, like ELF
+symbols are now.
+
+# Internal stuff
+
+## AddressSpace, Nodes, Regions and Symbols
+
+Address space contains an array of `4 * 1024 * 1024` slots. Each slot is just a
+pointer to a Node. Each Node may point to a single Region if the Node belongs to
+one or more Symbols. A Region is a dynamically allocated array of pointers to
+Symbols, that encompass this Region. A Symbol is one of the following:
+
+- A function;
+
+- A data span like string, blob, table of pointers or even more complex table.
+
+Imagine it like this. Symbol may intersect, which mean, for example, a function
+may contain a string at it's end, but it nevertheless is a part of the function,
+so it is an intersection. A way to represent this intersection is an
+intermediate entity called Region. Hence, multiple Nodes may share a Region and
+multiple Symbols may share a Region. Basically a Region, that refers to more
+than one Symbol represents a span of intersection of these Symbols.
+
+A little visualization of how Regions correspond to Symbols, each dot represents
+a byte, that form a span with the adjacent dots:
+
+```
+Symbol(fn)   | .........................
+Symbol(strz) |         .........
+Symbol(str)  |           ..
+Symbol(strz) |             .....
+Symbol(blob) |                          ......
+Symbol(strz) |                                .....
+Region       | ........
+Region       |         ..
+Region       |           ..
+Region       |             .....
+Region       |                  ........
+Region       |                          ......
+Region       |                                .....
+Node(pc)     | ..
+Node(pc)     |   ....
+Node(pc)     |       ..
+Node(data)   |         ..
+Node(data)   |           ..
+Node(data)   |             .....
+Node(pc)     |                  ..
+Node(pc)     |                    ....
+Node(pc)     |                        ..
+Node(data)   |                          ......
+Node(data)   |                                .....
+```
+
+This is how Nodes refer to Regions by holding pointers to them:
+
+```
+Symbol(fn)   | .........................
+Symbol(strz) |         .........
+Symbol(str)  |           ..
+Symbol(strz) |             .....
+Symbol(blob) |                          ......
+Symbol(strz) |                                .....
+Region       | ........
+Region       | ^ ^   ^ ..
+Region       | | |   | ^ ..
+Region       | | |   | | ^ .....
+Region       | | |   | | | ^    ........
+Region       | | |   | | | |    ^ ^   ^ ......
+Region       | | |   | | | |    | |   | ^     .....
+Node(pc)     | ..|   | | | |    | |   | |     ^
+Node(pc)     |   ....| | | |    | |   | |     |
+Node(pc)     |       ..| | |    | |   | |     |
+Node(data)   |         ..| |    | |   | |     |
+Node(data)   |           ..|    | |   | |     |
+Node(data)   |             .....| |   | |     |
+Node(pc)     |                  ..|   | |     |
+Node(pc)     |                    ....| |     |
+Node(pc)     |                        ..|     |
+Node(data)   |                          ......|
+Node(data)   |                                .....
+```
+
+This is how Regions refer to Symbols by holding pointers to them.
+
+```
+Symbol(fn)   | .........................
+Symbol(strz) | ^       .........^
+Symbol(str)  | |       ^ ..     |
+Symbol(strz) | |       | ^ .....|
+Symbol(blob) | |       | | ^    |       ......
+Symbol(strz) | |       | | |    |       ^     .....
+Region       | ........| | |    |       |     ^
+Region       |         ..| |    |       |     |
+Region       |           ..|    |       |     |
+Region       |             .....|       |     |
+Region       |                  ........|     |
+Region       |                          ......|
+Region       |                                .....
+Node(pc)     | ..
+Node(pc)     |   ....
+Node(pc)     |       ..
+Node(data)   |         ..
+Node(data)   |           ..
+Node(data)   |             .....
+Node(pc)     |                  ..
+Node(pc)     |                    ....
+Node(pc)     |                        ..
+Node(data)   |                          ......
+Node(data)   |                                .....
+```
-- 
cgit v1.2.3