Yesterday I suggested that the reason my old Haskell-based Fibonacci calculator program was so fast might be because Haskell was displaying the result before it was fully calculated. Now, the typical software development experience involves working with finite-sized integers (e.g. 16-bit, 32-bit, 64-bit, etc.) and floating point numbers. Almost without exception, the operations that a program performs against these types of values today are implemented completely on the processor so that they are essentially atomic – one moment all you have is two operands and the next you have the result. There is no in-between period where you have “part” of the result. As a result the idea that you might have one part of a numeric result and not have another may be somewhat counter-intuitive. But…
Most of the math that we perform, when we perform it by hand, is algorithmic. What I mean here is that we don’t simply multiply two large integers – we follow an algorithm that instructs us to multiply each of the digits and then add the results in a particular way. For example, suppose we multiply 521 by 482 to get 251122. Elementary school multiplication has us perform the following steps:
1) designate one number the multiplicand and the other the multiplier (I’ll choose 521 as the multiplicand and 482 as the multiplier)
2) choose the lowest order digit of the multiplier and note its magnitude (e.g. the power of ten corresponding to its position)
2) multiply each of the digits of the multiplicand by the power of ten corresponding to its place and by the multiplier digit you just selected, starting with the least significant digit of the multiplicand and working our way up, summing the results as we go and finally multiply the sum by the magnitude of the multiplier digit.
3) repeat step two for each of the remaining digits of the multiplier and their corresponding magnitudes.
4) finally, we sum these values to obtain our result:
The math that the processor performs on our behalf for finite integers uses the same algorithm to produce its result (only the computations are in base 2 instead of base 10). What I mean here is that while the processor does all of the work for us, even at the silicone level, the result is actually computed in stages. These stages might simply be cascades of gates but there is still a series of “steps” or “stages” involved in generating the number.
When a computer program performs arbitrary-precision math multiplication operations (aka BigInteger math), it uses the same algorithm, or one of a few more efficient alternatives, to compute the result. But what if we could get part of our result early? Perhaps we could take advantage of a few extra processor cores and start working with it.
Well, it turns out that we can, in most cases, get part of our result much earlier. All we have to do is change our order of operations. This typical grade-school version of the multiplication algorithm is devised for simplicity. While the way I’ve written it makes it look a bit more complex, the order of operations is roughly the same and I hope it makes it easier to understand the reordering. So let’s look at a different way of computing this result.
First we would like to get an idea of the magnitude of our result. Already we know that it has to be expressible in six digits since we’re multiplying two three digit numbers but we might like to know the most significant digit because that will get us a lot closer. That value is simply the highest order digit of the product of the two highest order digits plus any “carry” digits. In our case, 4*5=20 and, even after we factor in the carry from subsequent operations that we must add to our result, it is pretty easy to demonstrate that, since the least significant digit of 20 is zero, the addition of carry digits could never result in a change to our most significant digit: 2.
For the next digit, though, we’ll need the sum of any carry digits from lower order operations. Our next digit is the least significant digit from our first calculation (it’s worth pointing out that our previous calculation may have had only one digit, in which case we’d have skipped to this step) plus the most significant digit of the sum of our next lower order calculations plus any carry from lower order operations.
Since our second digit is non-zero, we need to consider carry so we have to perform the next operation:
Now we know that we don’t need to worry about carry from the next operation so we can finish computing the previous result:
Now there is still a chance that our next operation will yield a result that will affect our least significant digit of 1 but we can now reliably say that our next most significant digit will be 5. So now we’ll compute the next value.
Now there is no chance that adding this plus any carry could affect our 1000s digit so we can safely say that our next most significant digit is the 1 from our calculation above. So now our result so far is: 251000. All we have to do is finish our final calculation, which is simply:
Adding this to the result above, we get 122, which we can add to our result in progress to get 251122.
Now, yes, I’ve been fast and loose with the operations and I haven’t really provided a formal algorithm but I’m sure you can see the general framework. The more important point here is that I was able to work out the most significant digits early in the operation. While this isn’t particularly helpful when dealing with calculations involving six digit results, it can be crucial when working with arbitrary precision results involving hundreds of millions of digits. In these cases, knowing the most significant digits can enable you to start reporting the result before you’ve finished calculating it or simply to determine that it is outside the bounds of the problem set, etc.
There are two keys to taking advantage of this. The first is structuring calculations in a manner like the one above so that higher order computations required to determine the most significant “digit”(s) occur first. The second is structuring this evaluation so that these results are available to “callers” as soon as they are known. In a traditional language, this might involve some relatively exotic programming so this is where we go back to Haskell and functional languages in general.
Haskell performs computations using “non-strict” evaluation. This basically means that if you never request a result, it is never computed. Equally as importantly, when you do request a value, all of the calculations specific to computing it are performed at that moment. In a very real way, it’s as though when a Haskell program ran, it simply generated a set of instructions for computing the result needed and returned these to the caller. When the caller decided to report the result, he would read the recipe and performed the operations it contained to compute the result.
Since BigInteger math consists of working with numbers that are simply strings of digits in, say, base 4294967296, the same basic algorithm described above could be applied to yield the most significant “digits” early. These results could go directly to the next calculation or to a base 10 conversion so that they can be displayed in human-readable format. If Haskell’s base 10 conversion were built to work with higher order digits first then we might very well see the result that I first described – that the result was still being calculated as it was displayed.
In practice, Haskell may not perform this level of “pipelining” for arbitrary precision math operations and it is quite possible that BigInteger results must be computed fully before they become available for subsequent operations. My point here isn’t to say what Haskell does as much as it is to point out what is possible – particularly as it relates to taking advantage of more processor cores. It is possible that Haskell’s performance has somewhat more mundane roots. To this point, next time, I’ll take a look at the algorithm that I think is currently used by Haskell for base 10 conversion and how it fits in to this picture.