How I Built This Thing (Part 3) - Tools


Tools

Z88DK

Z88DK is the best toolchain available for ZX Spectrum development.

It provides a C compiler, linker, assembler, and the invaluable ticks cycle simulator — plus a solid collection of libraries. Writing World of Spells would not have been possible without it. Huge respect to the entire team.

It also includes zx0, which I use for compression. All level maps are stored compressed and decompressed only when needed.

For sound, there are few realistic choices on the Spectrum. The pt3 library handles AY music reliably, and bit_fx is very handy for buzzer-based sound effects.

Other Tools (Outside Z88DK)

  • A very useful Z80 instruction reference:
    https://clrhome.org/table/#%20

  • The built-in debugger  in the Fuse emulator, which I used heavily. It works — and that’s saying a lot.

The C Compiler

I’ll say this upfront: the Z88DK C compiler is complete, functional, and incredibly useful during algorithm development. For prototyping and experimentation, it’s invaluable.

Is it fast?

Reasonably fast — with caveats.

The Z80 instruction set is fundamentally hostile to stack-based languages for fast code. Efficient code on this CPU requires manual register allocation. That’s not something a compiler can realistically solve here — not without AI-level global reasoning or a human staring at disassembly for hours. 

Function Arguments

Functions with one or two arguments are fine. Push/pop into registers works well and remains interrupt-friendly.

Once arguments go deeper on the stack, things get expensive. You either start doing arithmetic on SP or accept slow memory access using index registers IX/IY.

You can use __FASTCALL__ to force register passing, but this is limited and only helps in very specific cases. 

Generated Code

Hand-optimized assembly can be up to 10× faster than compiler output — at a significant cost in development time.

At the assembly level, many tricks become available:

  • Self-modifying code
    LD HL, nn is faster than LD HL, (nn)
    It also reduces register pressure in many cases.
    This is used extensively in the raycaster and renderer.

  • Manual register allocation

  • Bit operations instead of arithmetic
    Example:
    SET 5, L is faster than
    LD A, L → ADD 16 → LD L, A
    This only works if data structures are carefully aligned.

  • Second register set (EXX, EX AF,AF')
    Extremely useful when registers run out.

  • Using IX / IY where appropriate
    Requires a custom interrupt handler.

  • Structure alignment to nn00 addresses

  • Moving hot code above 32768
    Below that, the ULA steals CPU cycles.

  • Moving the stack above 32768
    This alone gave roughly a 1% speedup.

  • Manual loop unrolling
    Especially effective in rendering.

  • Aggressive SP manipulation
    Temporarily moving SP and using PUSH/POP as fast memory access.
    Requires disabling interrupts.
    Rarely useful — but critical for the per-pixel interpolating renderer.

  • Loop tail lifting
    Moving the last part of a loop to the beginning often removes jumps.

The reality is simple:

Having a C compiler is fantastic for development — but its output is not fast enough for extreme performance.  That’s not Z88DK’s fault.

The Z80 is simply not designed for high-level languages in a way i.e. ARM architecture is.

Still, Z88DK’s assembler is a joy to use. Macro support alone saves countless hours.

What Was Not Used

  • The C standard library, including stdio and printf
    Unsuitable for real-time games.
    I wrote a minimal puts instead for World of Spells.

  • Runtime math
    As tempting as multiplication is — it’s avoided almost entirely.

    • 8×8 multiplication: ~200 cycles

    • Division: similar cost

    They are simply too slow.

Am I using them at all?
Yes — only when unavoidable.

atan2() uses a division + LUT combination. There is no faster way to compute ghost screen positions accurately. Classical Wolfenstein-style rotation matrices are even slower.

Other than FPS calculation, there is no multiplication in the runtime engine.

Boring (But Necessary) Software Practices

Some modern practices still apply. Object-oriented programming does not.

However:

  • Unit tests

  • Component tests

  • Performance tests

…are absolutely essential.

Not for every function — but the raycaster, renderer, and object system are all covered.

These tests run inside the emulator and produce performance statistics:

  • Counting 50 Hz interrupts during execution

  • Measuring cycle counts with the ticks emulator

Ticks does not emulate ULA contention, so results must be taken with caution — but it is still invaluable.

During late-stage optimization, these tests were critical. Many changes were risky and impossible to evaluate without accurate cycle measurements.

Without ticks, several OL5 → OL6 optimizations simply wouldn’t have happened.

Example performance test result screenshot from some older version:


Credits

This project relies on:

  • zcc compiler suite (compiler, linker, assembler)

  • ticks emulator

  • pt3 AY music library

  • bit_fx buzzer sound library

  • zx0 compression tools

  • 4-bit wide font

Huge thanks to the Z88DK team for making all of this possible.




Get World of Spells

Buy Now$6.00 USD or more

Comments

Log in with itch.io to leave a comment.

"Efficient code on this CPU requires manual register allocation" .. 

"Hand-optimized assembly can be up to 10× faster than compiler output — at a significant cost in development time."

I see you have expert knowledge of Z80. Question: would it be possible today to speed up, for example, R-type or Chase HQ by a significant percentage?

Their authors shall have better judgement than I do here. ;)

By looking at Chase HQ under debugger I don't think it would be easy. It seems to be highly optimized in most important parts. Minor improvements are for sure possible (they always are), but significant - I doubt it.

thanks .. And the original authors, I assume, are all over 65 years old and senility is starting to set in, heh.