A few days ago, my Facebook colleague Andrei Alexandrescu posted a note entitled Three Optimization Tips for C++, which reminded me that I had unfinished business with Haskell’s text
package.
I took his code, applied it to the text
package, and this is the story of what happened.
The text
package provides a type named Builder
for efficiently constructing Unicode strings from smaller fragments. A couple of years after Johan contributed the initial Builder
implementation, I wrote some helper functions to perform recurring tasks, such as rendering numbers.
In my initial number renderer, I put no effort into making my code fast—as this snippet demonstrates. Simplicity first.
positive :: (Integral a) => a -> Builder
positive = go
where
go n | n < 10 = digit n
| otherwise = go (n `quot` 10) <> digit (n `rem` 10)
digit n = singleton . intTodigit . fromIntegral
Having gotten the code working, I promptly forgot about it for a while. When I saw Andrei’s note nine months later, it tickled my memory: didn’t I have code that I could possibly improve? Indeed, when I took a look at my number rendering code, I found that I hadn’t even bothered to write benchmarks to measure its performance.
Before I discuss how I improved it, a brief description of how a Builder
works is in order. A Builder
provides a safe way to destructively write data into a list of fixed-size buffers, and to convert the result into an immutable Text
value. External users of Builder
never see the mutable buffers; they can only use safe access methods.
The code above uses the safe API. Each use of singleton
is bounds-checked internally: if a write will not fit into the current buffer, Builder
“finalizes” that buffer (putting it at the tail of the list), allocates a new one, and starts writing there. The <>
operator sequences the writes.
While this implementation is correct, it is far from fast, partly due to the overhead of performing a bounds check for every character to be written out. In fact, this approach is almost always slower than simply using the show
function, then converting the resulting [Char]
value to a Builder
!
My first observation was that if I knew the number of digits I needed up front, I could perform just one buffer-size check, instead of a separate check prior to rendering each digit.
countDigits :: (Integral a) => a -> Int
countDigits v0 = go 1 (fromIntegral v0 :: Word64)
where go !k v
| v < 10 = k
| v < 100 = k + 1
| v < 1000 = k + 2
| v < 10000 = k + 3
| otherwise = go (k+4) (v `quot` 10000)
This is almost identical to Andrei’s second digits10
function.
Once I was able to count digits, I could use the internal writeN
function to destructively update the buffer. writeN
ensures that a buffer is available with at least the requested amount of space; it then calls a user-supplied function, giving it the buffer to write to (marr
) and the position to start writing at (off
).
positive :: (Integral a) => a -> Builder
positive i
-- we win when we special-case single-digit numbers
| i < 10 = writeN 1 $ \marr off ->
unsafeWrite marr off (i2w i)
| otherwise = let !n = countDigits i
in writeN n $ \marr off ->
posDecimal marr off n i
posDecimal :: (Integral a) =>
forall s. MArray s -> Int -> Int -> a -> ST s ()
posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0
where go off v
| v < 10 = unsafeWrite marr off (i2w v)
| otherwise = do
unsafeWrite marr off (i2w (v `rem` 10))
go (off-1) (v `div` 10)
i2w :: (Integral a) => a -> Word16
i2w v = 48 + fromIntegral v
I then took Andrei’s very clever third digits10
function and translated that to Haskell. Syntax apart, there is a small difference between his function and mine: his is recursive, while mine is tail recursive (i.e. a loop).
countDigits :: (Integral a) => a -> Int
countDigits v0 = go 1 (fromIntegral v0 :: Word64)
where
go !k v
| v < 10 = k
| v < 100 = k + 1
| v < 1000 = k + 2
| v < 1000000000000 =
k + if v < 100000000
then if v < 1000000
then if v < 10000
then 3
else 4 + fin v 100000
else 6 + fin v 10000000
else if v < 10000000000
then 8 + fin v 1000000000
else 10 + fin v 100000000000
| otherwise = go (k + 12) (v `quot` 1000000000000)
fin v n = if v >= n then 1 else 0
(To be sure my intuition was correct, I did indeed measure recursive against tail recursive versions of my Haskell translation, and tail recursion wins by a few percent here.)
While this countDigits
function helped performance by quite a bit, there was another step remaining in following Andrei’s example: converting two digits at a time.
posDecimal :: (Integral a) =>
forall s. MArray s -> Int -> Int -> a -> ST s ()
posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0
where go off v
| v >= 100 = do
let (q, r) = v `quotRem` 100
write2 off r
go (off - 2) q
| v < 10 = unsafeWrite marr off (i2w v)
| otherwise = write2 off v
write2 off i0 = do
let i = fromIntegral i0; j = i + i
unsafeWrite marr off $ get (j + 1)
unsafeWrite marr (off - 1) $ get j
get = fromIntegral . B.unsafeIndex digits
A final surprise came when I decided to try an experiment: what if I replaced the separate uses of quot
and rem
with a single use of quotRem
? This improved performance by a further 30% on large 64-bit numbers! Why such a big difference? Because quotRem
can often be emitted as a single machine instruction instead of two, and division is expensive enough that in a hot loop, this helps a lot.
Many modern optimizing compilers can spot this kind of opportunity automatically. Although GHC’s optimizer performs many complex high-level transformations, its machinery for handling low-level optimizations is currently weak. (This is why you’ll see a few cases of v+v
instead of v*2
above, where I’m strength-reducing operations by hand instead of trusting the compiler.)
I was not at all surprised that Andrei’s optimization tips should translate perfectly to Haskell, as most of what he says has nothing to do with C++ itself. His advice should apply well to any language that provides low-level access to the machine and ends up running as native code. Since Haskell is often not understood to be a perfectly good imperative language, it’s easy for less experienced programmers to overlook the relevance of low-level concerns to Haskell performance.
The end result of all this tweaking is that the new number renderer in the text
package is not just much faster than its predecessor, it is also a lot faster than the venerable show
function. We retain the same API, external immutability, and type safety as before, but have a very nice five-fold increase in performance to show for our efforts!
I suppose that “Access denied” – https://docs.google.com/file/d/0B_CUoZk6rWOIejN5NDZYRmxGcTA/edit isn’t the right picture?
Oops, fixed. Thanks!
New primops were created for quotRem as a solution for GHC issue 5598 with duplicate divisions. So, if quot and rem are used separately, there will be still two primops and two calculations which might be joined by LLVM, but not by NCG.
This may be overrated, depending on how log performs but you can count digits this way :
baseLog = log 10 — computed once and reused
countDigits :: (Integral a) => a -> Int
countDigits x = log x/baseLog + 1