How I Built This Thing (Part 3) - Tools
Tools
Z88DK
Z88DK is the best toolchain available for ZX Spectrum development.
It provides a C compiler, linker, assembler, and the invaluable ticks cycle simulator — plus a solid collection of libraries. Writing World of Spells would not have been possible without it. Huge respect to the entire team.
It also includes zx0, which I use for compression. All level maps are stored compressed and decompressed only when needed.
For sound, there are few realistic choices on the Spectrum. The pt3 library handles AY music reliably, and bit_fx is very handy for buzzer-based sound effects.
Other Tools (Outside Z88DK)
-
A very useful Z80 instruction reference:
https://clrhome.org/table/#%20 The built-in debugger in the Fuse emulator, which I used heavily. It works — and that’s saying a lot.
The C Compiler
I’ll say this upfront: the Z88DK C compiler is complete, functional, and incredibly useful during algorithm development. For prototyping and experimentation, it’s invaluable.
Is it fast?
Reasonably fast — with caveats.
The Z80 instruction set is fundamentally hostile to stack-based languages for fast code. Efficient code on this CPU requires manual register allocation. That’s not something a compiler can realistically solve here — not without AI-level global reasoning or a human staring at disassembly for hours.
Function Arguments
Functions with one or two arguments are fine. Push/pop into registers works well and remains interrupt-friendly.
Once arguments go deeper on the stack, things get expensive. You either start doing arithmetic on SP or accept slow memory access using index registers IX/IY.
You can use __FASTCALL__ to force register passing, but this is limited and only helps in very specific cases.
Generated Code
Hand-optimized assembly can be up to 10× faster than compiler output — at a significant cost in development time.
At the assembly level, many tricks become available:
-
Self-modifying code
LD HL, nnis faster thanLD HL, (nn)
It also reduces register pressure in many cases.
This is used extensively in the raycaster and renderer. -
Manual register allocation
-
Bit operations instead of arithmetic
Example:
SET 5, Lis faster than
LD A, L → ADD 16 → LD L, A
This only works if data structures are carefully aligned. -
Second register set (
EXX,EX AF,AF')
Extremely useful when registers run out. -
Using IX / IY where appropriate
Requires a custom interrupt handler. -
Structure alignment to
nn00addresses -
Moving hot code above 32768
Below that, the ULA steals CPU cycles. -
Moving the stack above 32768
This alone gave roughly a 1% speedup. -
Manual loop unrolling
Especially effective in rendering. -
Aggressive SP manipulation
Temporarily moving SP and usingPUSH/POPas fast memory access.
Requires disabling interrupts.
Rarely useful — but critical for the per-pixel interpolating renderer. -
Loop tail lifting
Moving the last part of a loop to the beginning often removes jumps.
The reality is simple:
Having a C compiler is fantastic for development — but its output is not fast enough for extreme performance. That’s not Z88DK’s fault.
The Z80 is simply not designed for high-level languages in a way i.e. ARM architecture is.
Still, Z88DK’s assembler is a joy to use. Macro support alone saves countless hours.
What Was Not Used
-
The C standard library, including
stdioandprintf
Unsuitable for real-time games.
I wrote a minimalputsinstead for World of Spells. -
Runtime math
As tempting as multiplication is — it’s avoided almost entirely.-
8×8 multiplication: ~200 cycles
-
Division: similar cost
They are simply too slow.
-
Am I using them at all?
Yes — only when unavoidable.
atan2() uses a division + LUT combination. There is no faster way to compute ghost screen positions accurately. Classical Wolfenstein-style rotation matrices are even slower.
Other than FPS calculation, there is no multiplication in the runtime engine.
Boring (But Necessary) Software Practices
Some modern practices still apply. Object-oriented programming does not.
However:
-
Unit tests
-
Component tests
-
Performance tests
…are absolutely essential.
Not for every function — but the raycaster, renderer, and object system are all covered.
These tests run inside the emulator and produce performance statistics:
-
Counting 50 Hz interrupts during execution
-
Measuring cycle counts with the ticks emulator
Ticks does not emulate ULA contention, so results must be taken with caution — but it is still invaluable.
During late-stage optimization, these tests were critical. Many changes were risky and impossible to evaluate without accurate cycle measurements.
Without ticks, several OL5 → OL6 optimizations simply wouldn’t have happened.
Example performance test result screenshot from some older version:
Credits
This project relies on:
-
zcccompiler suite (compiler, linker, assembler) -
ticks emulator
-
pt3 AY music library
-
bit_fx buzzer sound library
-
zx0 compression tools
-
4-bit wide font
Huge thanks to the Z88DK team for making all of this possible.
Get World of Spells
World of Spells
FPS on ZX Spectrum 48k
| Status | Released |
| Author | jtpl |
| Genre | Action, Adventure |
| Tags | 3D, Doom, Exploration, Fantasy, Fast-Paced, First-Person, No AI, Sprites, ZX Spectrum |
| Languages | English |
Comments
Log in with itch.io to leave a comment.
"Efficient code on this CPU requires manual register allocation" ..
"Hand-optimized assembly can be up to 10× faster than compiler output — at a significant cost in development time."
I see you have expert knowledge of Z80. Question: would it be possible today to speed up, for example, R-type or Chase HQ by a significant percentage?
Their authors shall have better judgement than I do here. ;)
By looking at Chase HQ under debugger I don't think it would be easy. It seems to be highly optimized in most important parts. Minor improvements are for sure possible (they always are), but significant - I doubt it.
thanks .. And the original authors, I assume, are all over 65 years old and senility is starting to set in, heh.