Rapidly rendering fractals on stupidly unsuitable machines

So a week or so ago I wrote a toy Mandelbrot generator for the BBC Micro, capable of using 2.6-bit fixed point arithmetic to draw a pretty terrible vaguely-Mandelbroid image in twelve seconds. That program is now obsolete: the very poorly named Bogomandel is capable of drawing pretty good actual Mandelbrot and Julia images, at higher precision, faster. Here it is.

Update! I managed to squeeze an extra two bits of precision out of the number representation. There’s a bit of a performance penalty, but it’s possible to zoom all the way in to 0.25 now!

Bogomandel running in real time

(This is a recording of it running in real time on an emulated BBC Master: a 2MHz 65C02-based machine with 128kB of RAM.)

Bogomandel running in Jsbeeb

Run it for yourself in the comfort and privacy of your own web browser, using the Jsbeeb BBC Micro emulator!

Bogomandel's github page

See the code on Github.

Introduction

So the story behind the program is that as I was feeling pretty smug about myself for having rendered fractals at an unprecedentedly fast speed on the BBC Micro, I was chatting to someone about it on the vcfed forums. They, going by the handle reenigne (a.k.a. Andrew Jenner) — one of the minds behind the seminal 8088mph demo — came up with a cunning trick for using tables of squares to speed up Mandelbrot calculations; but that was on the 8088, which has vastly more memory and a generally more 16-bit friendly instruction set than the 6502. So I’d written off the 6502 for high-speed 16-bit fractals.

What happened was that reenigne came up with an idea for using thirteen bit integers, with 4.9 bit fixed point… and then handed me a 65C02 calculation kernel which was pretty much ready to drop into my program.

The use of thirteen bit integers was game changing, and I’m going to explain why, because it’s really interesting.

Mandelbrots

A quick recap: Mandelbrots are graphs of the complex plane, with real numbers along the $x$ axis and imaginary numbers along the $y$ axis. To determine the colour of any given point $c$ , then the simplified version is that I start with $z = c = x + yi$ and count the number of iterations of $z' = z^2 + c$ until $|z| > 2$ . (With an upper limit, as points inside the set will never reach that point.)

I, however, am but a simple software engineer and don’t understand complex numbers, so I prefer to think of $z$ as $z_r + z_i i$ . Expanding leads (and remembering that $i^2 = -1$ ) leads to this:

$\begin{align} z' = z'_r + z'_i i &= z^2 + c \\ &= (z_r + z_i i)^2 + c_r + c_i i \\ & = z_r^2 + 2 z_r z_i i + z_i^2 i^2 + c_r + c_i i \\ & = z_r^2 + 2 z_r z_i i - z_i^2 + c_r + c_i i \\ \end{align}$

I can then extract real and imaginary parts, leading to the much more software-friendly:

$\begin{align} z'_r &= z_r^2 - z_i^2 + c_r \\ z'_i &= 2 z_r z_i + c_i \end{align}$

Drawing pictures of the Mandelbrot set (and the Julia set, I’ll get onto that later) is all about trying to compute this as quickly as is humanly possible.

Fixed point and multiplication

My original program achieved this by having a gigantic lookup table of all possible numbers multiplied by all possible other numbers. This was achievable because my numbers were only eight bits wide, and the table was therefore 64kB, which would fit in the BBC Master’s four paged RAM banks. I used 2.6 fixed point arithmetic, which essentially means that I scale every number up by $2^6$ ; $1.0$ is represented by the integer 64.

Where this fell down was that my representation was very limited as to what numbers it can represent. The smallest number possible was the integer 1, which represented $\frac{1}{64} \approx 0.016$ ; the largest number was the integer 127 (the top bit is used for the sign), which represented $1 \frac{63}{64} \approx 1.98$ .

Because my entire numberspace only contained 256 different numbers, I could only render images which were 256 pixels wide. Except it was worse; the computation of $z'_r = z_r^2 - z_i^2 + c_r$ involves a large intermediate value which may be greater than $1.98$ , even though the end result might not be. So I was simply unable to render pixels where $z_r$ or $z_i$ were large enough where squaring them would cause an overflow. The resulting useful image is only 180x180.

The old program

But was still worse: I needed to test for $|z| > 2$ , but my numbers didn’t go as high as $2$ … the program actually used a cunning trick which I won’t go into (but which I stole from a blog post on Nothing Untoward, so you can read about it there), which worked for the main lake but was responsible for the strange colours and artifacts surrounding it in the above picture.

Tables of squares

So I needed more precision, to let me represent both bigger and smaller numbers. But moving up from eight bits would have made my multiplication table much bigger; a multiplication table for 16 bit numbers will consume 4GB of RAM, slightly out of reach for a BBC Master

However (and this was reenigne’s insight, not mine), maths can come to my aid.

Consider this:

$(x+y)^2 = x^2 + 2xy + y^2$

This can be rearranged to:

$2xy = x^2 + y^2 - (x-y)^2$

My original calculation has a $2xy$ in it, so let me try substituting the expression above in:

$\begin{align} z'_r &= z_r^2 - z_i^2 + c_r \\ z'_i &= 2 z_r z_i + c_i \\ &= z_r^2 + z_i^2 - (z_r - z_i)^2 + c_i \\ \end{align}$

This looks like a more complicated expression, but in fact all multiplications which aren’t squares have been eliminated. This is useful, because squares are significantly more tractable for using lookup tables: a table of squares for 8-bit values contains only $2^8 = 256$ elements, one for each possible value. The only other arithmetic operations are additions and subtractions, which are cheap, even on a 6502.

However, a table of squares for 16-bit values will contain $2^{16} = 65536$ elements, each of two bytes, so it’s 128kB. That’s still too big to fit in memory. But a table for a 13-bit integer will contain $2^{13} = 8192$ elements, which is 16kB, which will fit nicely in one of the BBC Master’s sideways RAM banks. Suddenly, table lookups for squares are feasible.

Faster speed

The 6502’s really bad at handling 16-bit values — I’ve written some posts on this. To look up one of these squares, I would have to do this (in pseudocode):

to do: result = n^2
  offset := n * bytes per entry
  address := address of table + offset
  result := dereference address

Because the 6502 can only directly manipulate 8-bit values, each of these operations has to be broken down into multiple instructions. Something like this:

to do: result = n^2
  lsr n+0
  rol n+1

  clc
  lda #lo(address of table)
  adc n+0
  sta address+0
  lda #hi(address of table)
  adc n+1
  sta address+1
  ...etc...

It gets really unwieldy, and quite slow.

This is where reenigne’s second insight comes in: I was working on 13-bit values in a 16-bit word. I didn’t need to put those values in the bottom of the word; I could store them in bits 13 to 1, leaving the bottom bit unused, so I could always set it to 0:

$\begin{matrix} \text{bit of word} & 15 & 14 & 13 & 12 & 11 & 10 & 9 & 8 & 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 \\ & & & n_{12} & n_{11} & n_{10} & n_9 & n_8 & n_7 & n_6 & n_5 & n_4 & n_3 & n_2 & n_1 & n_0 & 0 \end{matrix}$

This meant that the integer value of each one my fixed point numbers was even… and as the size of each element in my lookup table is two bytes, this meant that I no longer needed to scale the numbers before adding them to the address of the table. Each number was its own offset into the table!

I could also make use of the top two bits in the word, which are currently unused: it just so happened that the BBC Master’s sideways RAM bank, where I wanted to store my 16kB lookup table, is at 0x8000… which is 0b10000000_00000000. If I set the two unused bits at the top to 1 and 0 I get this:

$\begin{matrix} \text{bit of word} & 15 & 14 & 13 & 12 & 11 & 10 & 9 & 8 & 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 \\ & 1 & 0 & n_{12} & n_{11} & n_{10} & n_9 & n_8 & n_7 & n_6 & n_5 & n_4 & n_3 & n_2 & n_1 & n_0 & 0 \end{matrix}$

Now, not only was each number prescaled, but they had the address of the lookup table baked into them. Every valid number was a pointer to its own square!

The long machine code routine above became a very simple (and very fast):

to do: result = n^2
    lda (n)
    sta result+0
    ldy #1
    lda (), y
    sta result+1

This works because the operations I’m doing, addition, subtraction and taking the square (via the lookup table), either don’t care about any of these bits except the bottom 0… and the bottom 0 is harmless because it’s always 0.

I did need to take care to preserve the extra bits when doing computations. The low bit took care of itself, but the high bit needed some work. Consider computing 0x8400 - 0x8200 (a.k.a. $2-1$ ). The result is obviously 0x0200 (a.k.a. $1$ ). The significant high bit gets lost, so I can’t take its square without fixing the number up. Luckily, that’s really cheap to do:

to do: result = x - y
    sec
    lda x+0      ; subtract low byte
    sbc y+0
    sta result+0
    lda x+1
    sbc y+1
    and #&3F     ; perform fixup
    ora #&80
    sta result+1

It’s only two extra eight-bit instructions — and I only needed to do this on numbers which I wanted to square. There are some intermediate computations which didn’t need fixing up, so I didn’t (or rather reenigne didn’t).

Bailing out

There was final tiny piece of work to do. I needed to iterate until $|z| > 2$ . In components, that’s:

$\begin{align} |z| &= |z_r + z_i i| \\ &= \sqrt{z_r^2 + z_i^2} \end{align}$

Performing the squares is easy, but square roots are hard. Luckily I didn’t care about the actual value of $|z|$ , only whether it was bigger than $2$ . By squaring both sides of the comparison I could test $|z|^2 > 4$ instead; $|z|^2$ is simply $z_r^2 + z_i^2$ , and hey, I was already calculating this as an intermediate value! So getting proper iteration counts are nigh trivial.

But does it work?

Amazingly well. This (combined with some other optimisations that actually were mine) let it render most images in about thirty seconds; particularly complex scenes are still under a minute.

Of course, 13-bit numbers don’t really have the precision to zoom very far into the Mandelbrot set, and Bogomandel only supports three zoom levels (1, 2 and 4). Julia sets look a bit weird close up, so I think I’m running into precision problems there too, but I’m not quite sure how.

The BBC Master actually has four 16kB RAM banks, all mapped at 0x8000. Currently Bogomandel uses one. It’s tempting to try to use all of them and squeeze in another couple of bits of precision, allowing deeper zooms; but not only would this require switching in the right bank for every table lookup, but I wouldn’t be able to use reenigne’s baking-in-the-table-address trick, so I’d have to manually compute the address of the element in the table. This is all adding up to quite a lot of code, and every cycle in the kernel matters. So I’m not sure it’s worth it.

Update: I managed to squeeze in another couple of bits of precision. See the writeup below.

But then, that’s what I said last time. If anyone can prove me wrong (update: wronger), I’d be delighted… get in touch!

The result

For completeness, here’s the entire kernel which Bogomandel uses (although I’ve undone some of the self-modifying code for clarity). (The rest of it is here.) (Update: the code here is the 13-bit version, which is now obsolete. See the link for the up-to-date code.)

There are a lot of microptimisations, mainly involving trying to stash intermediate values in registers, but the 6502 only has three registers and can only do arithmetic in one, so there’s not much I can do here.

This routine is not called for every pixel in the image. It turns out that Mandelbrot and Julia sets have the property that any area which is completely bounded by a colour contains only that colour. (Because it’s simply connected.) (No, I don’t understand a word of that paper.) So the program recursively traces out the bounds of boxes, and if all sides of the box are the same colour it fills it and doesn’t recurse in. This leads to a massive performance boost, particularly for the dark lake in the middle of the set, which is very slow to render.

; Once zr, zi, cr, ci have been set up, use reenigne's Mandelbrot kernel to
; calculate the colour.
.mandel
    lda #ITERATIONS
    sta iterations
.iterator_loop
    ldy #1              ; indexing with this accesses the high byte 

    ; Calculate zr^2 + zi^2. 

    clc
    lda (zr)            ; A = low(zr^2) 
    tax                 
    adc (zi)            ; A = low(zr^2) + low(zi^2) = low(zr^2 + zi^2) 
    sta zr2_p_zi2_lo
    lda (zr), y         ; A = high(zr^2) 
    adc (zi), y         ; A = high(zr^2) + high(zi^2) = high(zr^2 + zi^2) 
    cmp #4 << (fraction_bits-8)
    bcs bailout
    sta zr2_p_zi2_hi

    ; Calculate zr + zi. 

    clc
    lda zr+0            ; A = low(zr) 
    adc zi+0            ; A = low(zr + zi) 
    sta zr_p_zi+0
    lda zr+1            ; A = high(zr) 
    adc zi+1            ; A = high(zr + zi) + C 
    and #&3F
    ora #&80            ; fixup 
    sta zr_p_zi+1

    ; Calculate zr^2 - zi^2. 

    txa                 ; A = low(zr^2) 
    sec
    sbc (zi)            ; A = low(zr^2 - zi^2) 
    tax
    lda (zr), y         ; A = high(zr^2) 
    sbc (zi), y         ; A = high(zr^2 - zi^2) 
    sta zr2_m_zi2+1 

    ; Calculate zr = (zr^2 - zi^2) + cr. 

    clc
    txa
    adc cr+0            ; A = low(zr^2 - zi^2 + cr) 
    sta zr+0
    lda zr2_m_zi2+1     ; A = high(zr^2 - zi^2) 
    adc cr+1            ; A = high(zr^2 - zi^2 + cr) 
    and #&3F
    ora #&80            ; fixup 
    sta zr+1

    ; Calculate zi' = (zr+zi)^2 - (zr^2 + zi^2). 

    sec
    lda (zr_p_zi)       ; A = low((zr + zi)^2) 
    sbc zr2_p_zi2+0     ; A = low((zr + zi)^2 - (zr^2 + zi^2)) 
    tax
    lda (zr_p_zi), y    ; A = high((zr + zi)^2) 
    sbc zr2_p_zi2+1     ; A = high((zr + zi)^2 - (zr^2 + zi^2)) 
    tay

    ; Calculate zi = zi' + ci. 

    clc
    txa
    adc ci+0
    sta zi+0
    tya
    adc ci+1
    and #&3F
    ora #&80            ; fixup 
    sta zi+1

    dec iterations
    bne iterator_loop

.bailout
    ; iterations contains the pixel colour

Bonus work: Julia sets

For fun, I extended the program to allow it to render Julia sets as well. This uses the same kernel, so it’s very easy.

Mandelbrot and Julia sets both use the same $z' = z^2 + c$ iteration; the difference is that for a Mandelbrot set, I start with $z = c = x + yi$ ; for a Julia set, I start with $z = x + yi$ and $c$ is constant across all pixels. This means that there are infinitely many Julia sets depending on which value of $c$ I pick; and, in fact, if I treat the Julia set’s $c$ as being a point on the Mandelbrot set, I can see an interesting relationship between how the Mandelbrot looks in the vicinity and the shape of the Julia set.

Bogomandel will allow toggling between the Mandelbrot and Julia sets by pressing the space bar; shift plus the cursor keys will adjust the Julia set’s $c$ (and that location will be highlighted on the Mandelbrot set).

Breaking news: better, faster, harder, longer (precision)

After writing the above, I realised that there was a way to use larger precision without needing to remap the lookup tables. With the right MMU setup, the BBC Master has contiguous RAM from about 0x0E00 (where my program is loaded) up to 0xC000. By carefully fiddling the top two bits of my numbers, I could use 0x4000 to 0xC000 as a 32kB lookup table. This allowed the use of 14-bit numbers rather than 13-bit. They look like this:

$\begin{matrix} \text{bit of word} & 15 & 14 & 13 & 12 & 11 & 10 & 9 & 8 & 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 \\ & Z & n_{13} & n_{12} & n_{11} & n_{10} & n_9 & n_8 & n_7 & n_6 & n_5 & n_4 & n_3 & n_2 & n_1 & n_0 & 0 \end{matrix}$

In order to map the number to an address inside the lookup table, I needed to set $Z$ to $0$ if $n_{13}$ is $1$ , and $1$ if $n_{13}$ is $0$ . In other words, $Z = !n_{13}$ . This made number fixups more complex, and therefore slower — 10.5 cycles rather than the previous 4 cycles — which made the entire program a little bit slower, but I gained the ability to add another zoom level.

I then managed to add another zoom level by adjusting my number representation; for my 14 bit numbers, I was using 5.9, where five bits were used for the integer part and 9 for the fractional part. This would let me represent numbers from -32 to 32. However, that was complete overkill — the top bit is (almost) entirely unused. So I could rearrange things and use 4.10 instead. Now I was only able to represent numbers from -16 to 16, but that’s fine, and the upside is that I was able to represent numbers all the way down to $\frac{1}{2^{10}} = \frac{1}{1024} \approx 0.00099$ . This allowed adding another zoom level.

The ‘almost’ in the above sentence is that in rare circumstances I did need numbers outside of the -16 to 16 range; not being able to represent these led to interesting rendering artifacts…

Rendering artifacts

…but I was able to work around this by clamping out-of-range values in the lookup tables to MAXINT, which prevented the overflow that produced the artifacts.

So, I gained two levels of precision for very little work. Which was nice.