This page discusses the basics of BigNum arithmetic, which refers to high-precision arithmetic using more digits than the native architecure of the processor supports directly.
Some of the content refers to older obsolete techniques that were in use prior to the migration to 64-bit code, but is kept here for historical purposes. Also see the page on floating-point arithmetic.
The first section is a general discussion of the basics of bignum arithmetic, including Karatsuba multiplication. At this point, with only about 500 bits at most in use, I don't have any need for more advanced methods like Toom-Cook or Schonhage-Strassen, or for Fourier-transform methods either, and they are not discussed.
The second section describes some details of how I have implemented bignum math in my software and provides some speed benchmarks.
The final two sections give performance data on the Pentium-IV and Core2 architectures.
I hope this page will help any aspiring young software engineers that might be working on their own bignum arithmetic functions. This area of computer science doesn't have many practical applications, but it is a lot of fun and a great exercise to hone your programming skills (not to mention the fact it's a pretty common college computer science class project). Plus, once you've got it working, it is really cool to have a program that can multiply thousand-digit numbers or calculate the millionth digit of pi.
Bignum arithmetic refers to software algorithms that perform the fundamental arithmetic operations of addition, subtraction, multiplication, and division on numbers with more digits than the hardware of the machine they are running on can support natively. These algorithms are an interesting and important branch of computer science in their own right, independent of their application to fractal art.
The Pentium processors (and the closely related Core, Core2, and Athlon processors) all have built-in math units that can very rapidly perform addition, subtraction, multiplication, division, and more complex operations like square roots, logarithms, and trigonometric functions. But the built-in hardware units are limited to 52 bits of precision, which is the equivalent of just about 16 digits of decimal numbers (it's actually 15.65 digits):
Digits = Log10 (252) = 52 Log10(2) = 15.653559774527022151114422525674...
That is nowhere near enough to allow zooming in to 1050 or 10100. The only way to achieve the vastly greater precision required for animations that zoom to depths like that is by using software arithmetic operations. That means literally writing functions to perform addition, subtraction, multiplication, division, etc. for very large numbers.
For many years, I used SSE2 instructions for my multiplication kernel operations, prior to the move to 64-bit code. The SSE2 instruction set features some extremely fast instructions that operate on multiple pieces of data at one time, but they have some limitations. In particular, they have no capability of propagating carries, so some rather complicated carry-saving techniques have to be used. This is how many hardware multipliers work internally. In the end, the added speed is worth the extra complications, but the details are outside the scope of general bignum operations and we won't go into it here, especially since this code is now obsolete after migrating to 64-bit arithemtic functions.
High-precision numbers in my software are stored as sequences of unsigned 32-bit "digits" (referred to as "limbs" in the GNU math library). This remains the case even after the migration to 64-bit code. From the perspective of the C++ code, each digit is a 32-bit value, but the assembly language code works with them in groups of two.
Numbers are stored in a fixed-point format with one digit of integer. The smallest precision that is implemented in my software is four 32-bit digits, which means one digit of integer and three 32-bit digits to the right of the decimal point. Precision can be increased in chunks of two 32-bit digits. Previously, I had to work in even-numbered chunks like this because of the way the SSE2 instruction set works. Now, the 64-bit instructions work in groups of two 32-bit digits. In a more general case, not using 64-bit instructions, precision can be added one digit at a time.
All digits are unsigned. A separate sign bit is used.
Traditional addition works by adding digits consecutively starting from the rightmost (least-significant) digit and proceeding to the left, propagating the carries that are generated from each individual digit sum as you go.
For example, suppose you want to add 3867 + 1946. You start with 7+6 and proceed like this:
Carry: | 1 | 1 | 1 | |
3 | 8 | 6 | 7 | |
1 | 9 | 4 | 6 | |
Sum: | 5 | 8 | 1 | 3 |
Carries are propagated upward each time two digits are added. At most, only one bit of carry can be generated each time two digits are added. The x86 ADD and ADC instructions work well for this, and addition can be done with a simple loop.
Multiplication is a bit more complicated than addition. To multiply two N-digit numbers, we have to form N x N = N2 pairs of individual digit products, then perform a whole bunch of additions.
For example, consider 3867 x 1946. When we calculate this on paper, we break the problem down into (3867x6) + (3867x4)x10 + (3867x9)x100 + (3867x1)x1000.
3 | 8 | 6 | 7 | |||||
1 | 9 | 4 | 6 | |||||
3867x6: | 2 | 3 | 2 | 0 | 2 | |||
3867x4: | 1 | 5 | 4 | 6 | 8 | |||
3867x9: | 3 | 4 | 8 | 0 | 3 | |||
3867x1: | 3 | 8 | 6 | 7 | ||||
Sum: | 7 | 5 | 2 | 5 | 1 | 8 | 2 |
For a fixed-point multiplication system like I use, where the first digit represents the integer part and the next three digits represent the fraction, this example shows how to multiply 3.867 x 1.946, giving a result of 7.525182.
Note that the product has twice as many digits as the original numbers did. For some purposes, it makes sense to save some of them or all of them, but I don't do that. I truncate each multiplication product so there are always the same number of digits in the product as there were in the original numbers. So the final answer here would be 7525, corresponding to 7.525.
That truncation means that some of the digit products don't need to be calculated. For example, we don't use the 2 that comes from the product of 6x7=42 in the rightmost place. We have to be careful, because there is a carry (the 4) from that digit product into the next place, and that can affect the final answer. Careful analysis of this shows that you can drop some digit products and not affect the final answer by more than 1 bit. I am not going to reproduce that analysis here, but just show the final result. To see this better, we need to rearrange the multiplication operation above into a square matrix:
3 | 8 | 6 | 7 | |
1 | 1x3 | 1x8 | 1x6 | 1x7 |
9 | 9x3 | 9x8 | 9x6 | 9x7 |
4 | 4x3 | 4x8 | 4x6 | 4x7 |
6 | 6x3 | 6x8 | 6x6 | 6x7 |
The multiplications in dim coloring can safely be skipped without affecting the final truncated product by more than one least-significant bit.
Here is a larger table showing the 8x8 case, where an 8-digit number A=a7a6...a1a0 is multiplied by an 8-digit number B=b7a6...b1b0:
a7 | a6 | a5 | a4 | a3 | a2 | a1 | a0 | |
b7 | b7a7 | b7a6 | b7a5 | b7a4 | b7a3 | b7a2 | b7a1 | b7a0 |
b6 | b6a7 | b6a6 | b6a5 | b6a4 | b6a3 | b6a2 | b6a1 | b6a0 |
b5 | b5a7 | b5a6 | b5a5 | b5a4 | b5a3 | b5a2 | b5a1 | b5a0 |
b4 | b4a7 | b4a6 | b4a5 | b4a4 | b3a3 | b4a2 | b4a1 | b4a0 |
b3 | b3a7 | b3a6 | b3a5 | b3a4 | b3a3 | b3a2 | b3a1 | b3a0 |
b2 | b2a7 | b2a6 | b2a5 | b2a4 | b2a3 | b2a2 | b2a1 | b2a0 |
b1 | b1a7 | b1a6 | b1a5 | b1a4 | b1a3 | b1a2 | b1a1 | b1a0 |
b0 | b0a7 | b0a6 | b0a5 | b0a4 | b0a3 | b0a2 | b0a1 | b0a0 |
In general, as the number of digits increases, we save almost half the multiplications.
To calculate the exact number of digits saved, let N be the number of digits in each number we are multiplying. The number of digit products on the diagonal and in the upper left half is N(N+1)/2. The number of digit products in the subdiagonal is N-2. So the total is N(N+1)/2 + (N-2) = (N2+N-4)/2.
If A=B in the multiplication grid above, then all the digit products in the lower left half of the grid are duplicates of the products in the upper right, since, for example, a4a6 = a6a4. The following table shows the additional digit products that can be eliminated, dimmed in a different color. For large values of N, the final result is that we are doing about 1/4 as many digit multiplications as before. Unlike the previous saving, taking advantage of this symmetry when squaring numbers introduces no loss of precision.
a7 | a6 | a5 | a4 | a3 | a2 | a1 | a0 | |
a7 | a7a7 | a7a6 | a7a5 | a7a4 | a7a3 | a7a2 | a7a1 | a7a0 |
a6 | a6a7 | a6a6 | a6a5 | a6a4 | a6a3 | a6a2 | a6a1 | a6a0 |
a5 | a5a7 | a5a6 | a5a5 | a5a4 | a5a3 | a5a2 | a5a1 | a5a0 |
a4 | a4a7 | a4a6 | a4a5 | a4a4 | a3a3 | a4a2 | a4a1 | a4a0 |
a3 | a3a7 | a3a6 | a3a5 | a3a4 | a3a3 | a3a2 | a3a1 | a3a0 |
a2 | a2a7 | a2a6 | a2a5 | a2a4 | a2a3 | a2a2 | a2a1 | a2a0 |
a1 | a1a7 | a1a6 | a1a5 | a1a4 | a1a3 | a1a2 | a1a1 | a1a0 |
a0 | a0a7 | a0a6 | a0a5 | a0a4 | a0a3 | a0a2 | a0a1 | a0a0 |
Karatsuba multiplication utilizes a clever identity to multiply two 2-digit numbers using only three multiplications. To see how it works, we'll have to use some more formal notation.
Let e represent the base of the number system we're using. For normal decimal numbers, e=10. For a system in a computer that uses one byte for each digit, e=256. For my system, which uses a 32-bit unsigned integer for each digit, e=232=4294967296.
Now let A = a1*e + a0, and B=b1*e + b0. (I'm using * for multiplication here).
To form the product of A*B, the traditional way, we calculate
[Eq1] A*B = (a1*b1)e2 + (a1*b0+a0*b1)e1 + a0*b0
This takes four multiplications.
The Karatsuba trick is based on the following fact (just write it out on paper and you will see it's true):
[Eq2] (a1+a0)*(b1+b0)-a1*b1-a0*b0 = a1*b0+a0*b1
The right-hand side of Eq2 is exactly the same as the middle term in Eq1 (the part in front of e1) so we can rewrite Eq1 substituting the left-hand side of Eq2:
[Eq3] A*B = (a1*b1)e2 + [(a1+a0)*(b1+b0)-a1*b1-a0*b0]e1 + a0*b0
A simpler way to write this is:
[Eq4] A*B = Xe2 + [U-X-Y]e1 + Y,
where X=a1*b1, Y=a0*b0, and U=(a1+a0)*(b1+b0)
The three multiplications required are just those needed to calculate X, Y, and U. (There is a subtle implementation issue involved in calculating U...See below.)
For large numbers of digits, this method can be applied recursively. The savings can be substantial -- the number of multiplications is reduced from N2 to N1.6 for very large values of N.
For small numbers of digits, the overhead of the extra additions may be greater than the saving from the reduction in multiplications. In fact, in my arithmetic functions, for N=16, the Karatsuba method was faster than my earlier conventional multiplication without truncation of less-significant digit products. But after introducing the optimizations described above, the Karatsuba method no longer faster, even for N=16. (N=16 means 15 32-bit digits, for a total of 480 bits, or 144 decimal digits after the decimal point).
One huge problem is kind of subtle, but maybe some of more experienced software engineers spotted it in the above equations. The calculation of U requires first performing two additions, which means there is the possibility of carries. If each digit is 32 bits, then there is nowhere for the carries to go. So you either have to reduce the digit size to 31 bits or keep track of them separately. Neither approach is cheap or easy. My Karatsuba implementation kept a carry flag for each sum, then added terms in after doing the multiplication as needed.
Another problem for my operations is that a subtraction is required. This is not a big deal usually, but with the SSE2 instructions and their lack of carry or borrow, it creates some major logistical hassles in keeping track of things.
The need for high-precision arithmetic is driven by the need to take differences between numbers that are close together. It is not so much trying to handle very large or very small numbers, but rather relatively small numbers that are extremely close together.
Once a bignum arithmetic function is written, it is easy to have it run for any number of digits -- but how many is enough? The time taken to perform multiplications increases as the square of the number of digits, so we don't want to use more digits of precision than we really need given the magnification of the image we're trying to render. So we should think about how much precision is needed and use as few digits as possible to adequately render the image.
The most obvious effect of limited precision shows up when calculating the coordinates in the complex number plane that correspond to the pixel numbers in the image. This is done as follows:
X = Nx * s + X0
Y = Ny * s + Y0
where
Nx = x-coordinate pixel number
Ny = y-coordinate pixel number
s = spacing of pixels (square!) in the complex plane
X0 = left coordinate of image
Y0 = top coordinate of image
We start running out of precision when s, the spacing of pixels in the complex plane, gets really small compared to X0 or Y0. But how small can it get before we need to add more digits?
Let d be the smallest number that can be represented by one least-significant bit in the number system in use. For the IEEE-754 double-precision numbers, that is d = 2-52 = 2.22e-16. This is the smallest number that can be added to 1.000... that will yield a result different from 1.000... The numbers that can be exactly represented with standard double-precision quantities are integer multiples of this small value.
If we want to be sure that our X and Y values are accurately located on the complex plane, we need to consider how large the deviation of any number can be from the closest exactly-representable number. Since the spacing between the exactly-representable numbers is d, we can never be more than d/2 away from any arbitrary real number.
The spacing of the pixels in the complex plane is s, and we want to be able to represent that spacing sufficiently accurately that we do not see any noticeable jitter in the positions of the pixels in the image. Let E be the maximum proportional error we are willing to tolerate in pixel location (e.g. maybe 1% or 0.1% or something like that). Since the largest absolute error possible is d/2, we have:
(d/2) / s < E
Now, we can express s in terms of more meaningful quantities like this:
s = W / N
where
W = size of image in the complex plane (determined by whichever direction is
larger, usually the X-direction)
N = number of pixels in the larger direction (again, usually the X-direction)
and this leads to
W > (d N) / (2 E)
This gives the minimum size W of an image for a precision d and pixel resolution N, given our choice of relative error E.
For a 1% error (E=0.01) in a 640-pixel wide image in IEEE-754 double-precision format, we get:
W = (2.22e-16 * 640) / (2 * 0.01) = 7.1e-12
This is the smallest size image that we can draw and maintain a maximum 1% error in pixel location. If the image gets smaller than that, we have to switch to high-precision arithmetic to maintain better than 1% maximum error.
For still images, the human eye is a little more tolerant of pixel positioning error than in animations because the eye (the brain, really) is very good at noticing even tiny amounts of temporal jitter in the locations of the pixels from one frame to the next. For animations, generally any error larger than 0.1% has a reasonable chance of being noticed. I usually switch to the next higher level of precision when the increment s becomes less than 1000 times the d-value for the current precision, which means the total image width is 320,000 times the working precision's d-value.
Addition and subtraction scale linearly with the number of digits, so they are always much faster than multiplication. I currently implement then with regular general-purpose assembly language instructions. The SSE2 instructions do not support a carry, so they are not well-suited to multi-digit addition operations.
For fractals that do not require high-precision division, which are all the convergent type fractals like the Mandelbrot set, a good machine-code division function is not needed. Currently I use a simple binary search algorithm to take the reciprocal of a bignum that must be greater than 1, from which a division function can be built easily.
Convergent fractals do require division, and that means that floating-point arithmetic is necessary to properly calculate them. Floating point was added in January 2011, and with that came an assembly-language division function, since that was the whole point of adding floating point in the first place.
Performance data are shown in the table below for various levels of precision.
Update Dec 2008: Improvements in the low-level assembly language code for the SSE-based multiplication have given speed improvements up to almost 50%. The table below shows the new speeds and the old speeds on the same 3.2 GHz Pentium IV.
Explanation of table data:
Total Bits | Fractional Bits | Fractional Part Decimal Digits | Multiplications, Millions/second | Mandelbrot Operations, Thousands/second | ||
---|---|---|---|---|---|---|
New | Old | New | Old | |||
128 | 96 | 28.89 | 11.64 | 10.64 | 1768 | 1738 |
192 | 160 | 48.16 | 8.10 | 7.14 | 1531 | 1488 |
256 | 224 | 67.43 | 5.93 | 4.57 | 1205 | 1067 |
320 | 288 | 86.70 | 3.83 | 2.91 | 979 | 771 |
384 | 352 | 105.96 | 3.08 | 2.29 | 840 | 609 |
448 | 416 | 125.23 | 2.43 | 1.78 | 722 | 488 |
512 | 480 | 144.49 | 2.03 | 1.45 | 573 | 388 |
Updated Benchmark: Now this software can draw a 640x480 image of the interior of the Mandelbrot set (centered at 0+0i in the complex number plane) with 1000 iterations at 106 digits of precision (352 bits of fraction) in 6.1 minutes on a 3.2 GHz Pentium IV [previously 8.4 minutes].
The Core2 processor is touted as having an intrinsically faster architecture than the Pentium-IV. How does a single core of the quad-core Core2 running at 2.6 GHz compare to the Pentium-IV running at 3.2 GHz? Here are the results:
Total Bits | Fractional Bits | Fractional Part Decimal Digits | Multiplications, Millions/second | Mandelbrot Operations, Thousands/second |
---|---|---|---|---|
128 | 96 | 28.89 | 21.32 | 1734 |
192 | 160 | 48.16 | 11.86 | 1491 |
256 | 224 | 67.43 | 8.31 | 1311 |
320 | 288 | 86.70 | 5.71 | 1086 |
384 | 352 | 105.96 | 4.44 | 953 |
448 | 416 | 125.23 | 3.49 | 829 |
512 | 480 | 144.49 | 2.87 | 643 |
Overall, the basic multiplication is significantly faster on the Core2. The Mandelbrot operation numbers are a tiny bit slower on the Core2 for lower precisions, which is probably sampling error. There is a small but noticeable increase in speed at all precisions starting at 256 total bits.