# Disassembly rules

There are several cases and features of the disassembly process that must be
discussed before implemented. It also may serve as a documentation after the
features has already been implemented.

## Unaligned data and instructions

All instructions of the Motorola 68000 ISA sizes are multiples of 2. The
original M68000 ISA (not 680010 or other) also does not support unaligned
instruction access, i.e. jumps to address with lowest bit set are invalid and
will lead to whatever on the real hardware, I don't know exactly what would it
be. But anyway the m68k-disasm is about to get support for such instructions,
because:

- At least GNU AS and Sierra ASM68 support it;

- Data may be unaligned without a problem, hence allowing instructions to be
  unaligned will yield a consistent implementation of the disassembly algorithm
  across all the things the disassembler emits.

The only way unaligned instruction execution may happen that I can see now is
jumping into an unaligned location.

## Jumping into the middle of an instruction

It is unlikely to happen in ф real binary, produced by an assembler, but
assemblers are very capable of producing such code, and even more than that: it
may work without a problem, since a part of a long instruction may be a valid
short instruction. A reference to a location for jumping into the middle of an
instruction may be done with simple arithmetic like this (GNU AS syntax):

```asm
label:
        andiw #0x4e71,%d1
        bras label+2
```

Which may be disassembled just fine:

```asm
L00000000:
        andiw #20081,%d1 | 0241 4e71 @00000000
        bras L00000000+2 | 60fc @00000004
```

But this disassemble does not consider the `nop` instruction hidden inside the
`andiw` instruction. Here is another disassemble variant of the same code just
to show the `nop` instruction, obtained using a trace table with PC trace
entries on addresses `0`, `2` and `4`:

```asm
        .short 0x0241 | 0241 @00000000
L00000002:
        nop | 4e71 @00000002
        bras L00000002 | 60fc @00000004
```

## Unaligned jump into the middle of an instruction

But what about unaligned jump?

```asm
label:
        andiw #0x4e71,%d1
        bras label+3
```

The listing is valid, the assembler will emit exactly what it represents, but
now it dos not make any sense, since `7160` is not a valid instruction. But if
it was `7060`, then it would be valid, because `7060` is `moveq #60,%d0`.
Although it is not guaranteed to work on the real hardware, so it could be an
attribute for some heuristics.

On the other hand the approach of m68k-disasm is to not try to be smarty pants
about everything it handles, at least by default. The priority is producing a
listing that will be translated by an assembler into a binary file that is 100%
identical to the initial binary being disassembled, no matter what.

... TODO ...


## Two modes of operation (QuickPeek mode)

There are basically two modes of operation in m68k-disasm disassembler:
**QuickPeek** mode and **Proper** mode.

In QuickPeek mode there are not may rules. It starts at offset `0` of the file,
takes 2 bytes and tries to interpret them as an instruction. If an instruction
takes more than two bytes, all other bytes are taken into account. A Node is
created that occupies as many slots as it needs in AddressSpace. A Node, as well
as AddressSpace are internal structures. AddressSpace is just an array of
pointers to Nodes and it can fit up to 4Mi pointers in it. Let's say there is
`nop` (`4e71`) instruction at offset `0`. A single Node is created and placed at
slots `0` and `1` in AddressSpace. Next let's say `andiw #0x03ff,%d1`
(`024003ff`) instruction encountered, which takes 4 bytes, so it's Node is
created that occupies 4 slots in AddressSpace at offsets `3`, `4`, `5` and `6`.
And so on, and so on.

If an instruction references a memory location as code, i.e. it jumps/branches
to a location not yet handled by the disassembler, then a Node is being created
at the location, but it will preserve 2 bytes alignment. For example, an
encounter of an instruction branching to offset `17` will create a Node at
offset `16` of size 2, if it does not exist, and it won't be disassembled yet.
It will be marked as *not handled* in some way. When disassembler finally
reaches address `16` by walking over instructions one by one, it will try to
disassemble it go futher... Except if it won't turn out that an instruction at
offset `14` takes 4 bytes when disassembled and virtually *includes* the
instruction located at offset `16`. This will lead to a Node at location `16` to
be moved to location `14` and disassembled instead. And, as a result, the
instruction will cover both locations, at `14` and at `16`, taking 4 slots in
AddressSpace in total.

One byte of the binary data being disassembled corresponds to one slot in
AddressSpace, as you have probably noticed already.

What if an instruction being disassembled is invalid? Well, it is left intact
then and emitted as `.short 0xXXXX` line, with label above it, in case if it has
been referenced from somewhere.

Hence the rules of the QuickPeek modes are such:

- Disassemble always starts at offset `0`

- Disassemble only takes place at address aligned to 2, i.e. even offsets.

- References to unaligned locations (odd offsets) are expressed regarding to
  aligned locations only, with some arithmetic.

- No Nodes are placed at unaligned addresses (odd offsets).

- Referencing forward (not yet handled) locations create unhandled Nodes.

- Unhandled nodes may be extended backwards and forwards when reached by
  disassemble procedure, retaining all references to them. But unhandled Nodes
  are never shrunk or removed.

## Proper mode

The Proper mode is only activated when a trace table has been provided
explicitly or if the binary format supports and contains entry point or more
information about code and data locations like functions or just labels. In
contrast to QuickPeek mode nothing is assumed if not specified explicitly.
Everything is considered possible, even unaligned instructions. Hence Nodes may
be placed even on odd offsets like `17` and occupy odd number of slots in
AddressSpace , but not less than 1 slot.

But instead of disassembling everything, Proper mode takes guidance from the
trace table data. It starts at the lowest address specified as code (PC trace or
function) in the trace table and disassembles from this point.

If Node is a PC traced Node, it is disassembled and everything after it is
skipped until another PC trace Node or function Node encountered. Unless
`-fwalk` feature flag is given.

When `-fwalk` feature flag is specified and Node is a PC trace Node, it behaves
similarly to the QuickPeek mode and tries to disassemble instructions one by one
until it reaches either a) traced data Node b) unconditional control flow
instruction like `rts`, `jmp`, `bsr` and so on c) invalid instruction.

If Node is a function Node then... well, it does not make much sense in having
function Nodes at all, because it will span across multiple instructions, but a)
it can be only one instruction per node and b) it can only be one node per slot.
It should probably be in other relationship, maybe even half-implicit, like ELF
symbols are now.

# Internal stuff

## AddressSpace, Nodes, Regions and Symbols

Address space contains an array of `4 * 1024 * 1024` slots. Each slot is just a
pointer to a Node. Each Node may point to a single Region if the Node belongs to
one or more Symbols. A Region is a dynamically allocated array of pointers to
Symbols, that encompass this Region. A Symbol is one of the following:

- A function;

- A data span like string, blob, table of pointers or even more complex table.

Imagine it like this. Symbol may intersect, which mean, for example, a function
may contain a string at it's end, but it nevertheless is a part of the function,
so it is an intersection. A way to represent this intersection is an
intermediate entity called Region. Hence, multiple Nodes may share a Region and
multiple Symbols may share a Region. Basically a Region, that refers to more
than one Symbol represents a span of intersection of these Symbols.

A little visualization of how Regions correspond to Symbols, each dot represents
a byte, that form a span with the adjacent dots:

```
Symbol(fn)   | .........................
Symbol(strz) |         .........
Symbol(str)  |           ..
Symbol(strz) |             .....
Symbol(blob) |                          ......
Symbol(strz) |                                .....
Region       | ........
Region       |         ..
Region       |           ..
Region       |             .....
Region       |                  ........
Region       |                          ......
Region       |                                .....
Node(pc)     | ..
Node(pc)     |   ....
Node(pc)     |       ..
Node(data)   |         ..
Node(data)   |           ..
Node(data)   |             .....
Node(pc)     |                  ..
Node(pc)     |                    ....
Node(pc)     |                        ..
Node(data)   |                          ......
Node(data)   |                                .....
```

This is how Nodes refer to Regions by holding pointers to them:

```
Symbol(fn)   | .........................
Symbol(strz) |         .........
Symbol(str)  |           ..
Symbol(strz) |             .....
Symbol(blob) |                          ......
Symbol(strz) |                                .....
Region       | ........
Region       | ^ ^   ^ ..
Region       | | |   | ^ ..
Region       | | |   | | ^ .....
Region       | | |   | | | ^    ........
Region       | | |   | | | |    ^ ^   ^ ......
Region       | | |   | | | |    | |   | ^     .....
Node(pc)     | ..|   | | | |    | |   | |     ^
Node(pc)     |   ....| | | |    | |   | |     |
Node(pc)     |       ..| | |    | |   | |     |
Node(data)   |         ..| |    | |   | |     |
Node(data)   |           ..|    | |   | |     |
Node(data)   |             .....| |   | |     |
Node(pc)     |                  ..|   | |     |
Node(pc)     |                    ....| |     |
Node(pc)     |                        ..|     |
Node(data)   |                          ......|
Node(data)   |                                .....
```

This is how Regions refer to Symbols by holding pointers to them.

```
Symbol(fn)   | .........................
Symbol(strz) | ^       .........^
Symbol(str)  | |       ^ ..     |
Symbol(strz) | |       | ^ .....|
Symbol(blob) | |       | | ^    |       ......
Symbol(strz) | |       | | |    |       ^     .....
Region       | ........| | |    |       |     ^
Region       |         ..| |    |       |     |
Region       |           ..|    |       |     |
Region       |             .....|       |     |
Region       |                  ........|     |
Region       |                          ......|
Region       |                                .....
Node(pc)     | ..
Node(pc)     |   ....
Node(pc)     |       ..
Node(data)   |         ..
Node(data)   |           ..
Node(data)   |             .....
Node(pc)     |                  ..
Node(pc)     |                    ....
Node(pc)     |                        ..
Node(data)   |                          ......
Node(data)   |                                .....
```