The other night, I had a random whim to spend a couple of minutes looking at the performance of UTF-8 decoding in the Haskell Unicode text package. Actually, rather than look at the actual performance, what I did was use Don Stewart's excellent ghc-core
tool to inspect the high-level "Core" code generated by the compiler. Core is the last layer at which Haskell code is still somewhat intelligible, and although it takes quite a bit of practice to interpret, the effort is often worth it.
For instance, in this case, I could immediately tell by inspection that something bad was afoot in the inner loop of the UTF-8 decoder. A decoder more or less has to read a byte of input at a time, as this heavily edited bit of Core illustrates:
let x2 :: Word8
x2 = case readWord8OffAddr# {- ... -} of
(# s, x #) -> W8# x
in {- ... -}
What's important in the snippet above is that the value x2
is boxed, i.e. packaged up with a W8#
constructor so that it must be accessed via a pointer indirection. Since a decoder must read up to 4 bytes to emit a single Unicode code point, the loop was potentially boxing up 4 bytes, then immediately unboxing them:
case x2 of
W8# x2# ->
case x3 of
W8# x3# ->
case x4 of
W8# x4# -> {- ... -}
While both boxing and unboxing are cheap in Haskell, they're certainly not free, and we surely don't want to be doing either in the inner loop of an oft-used function.
We can see why this was happening at line 96 of the decodeUtf8With
function. I'd hoped that the compiler would be smart enough to unbox the values x1
through x4
, but it turned out not to be.
Fixing this excessive boxing and unboxing wasn't hard at all, but it made the code uglier. The rewritten code had identical performance on pure ASCII data, but was about 1.7 times faster on data that was partly or entirely non-ASCII. Nice! Right?
Not quite content with this improvement, I tried writing a decoder based on Björn Höhrmann's work. My initial attempt looked promising; it was up to 2.5 times faster than my first improved Haskell decoder, but it fell behind on decoding ASCII, due to the extra overhead of maintaining the DFA state.
In English-speaking countries, ASCII is still the king of encodings. Even in non-English-speaking countries that use UTF-8, a whole lot of text is at least partly ASCII in nature. For instance, other European languages contain frequent extents of 7-bit-clean text. Even in languages where almost all code points need two or more bytes to be represented in UTF-8, data such as XML and HTML still contains numerous extents of ASCII text.
What would happen if we were to special-case ASCII? If we read a 32-bit chunk of data, mask it against 0x80808080, and get zero, we know that all four bytes must be ASCII, so we can just write them straight out without going through the DFA (see lines 82 through 110).
As the numbers suggest, this makes a big difference to performance! Decoding pure ASCII becomes much faster, while both HTML and XML see respectable improvements. Of course, even this approach comes with a tradeoff: we lose a little performance when decoding entirely non-ASCII text.
Even in the slowest case, we can now decode upwards of 250MB of UTF-8 text per second, while for ASCII, we exceed 1.7GB per second!
These changes have made a big difference to decoding performance across the board: it is now always between 2 and 4 times faster than before.
As a final note, I haven't released the new code quite yet - so keep an eye out!
Cool. These numbers look great. I’m looking forward to its integration in Data.Text.
It might be worth it to check for alignment when doing the 32 bit loads. So, change line 92 to
if (state == UTF8_ACCEPT && (s & 3) == 0) {
Then you do at most 3 iterations through the slow path, to get to a (hopefully) faster fast path. In addition, the code would also work on non x86 platforms.
Also, shouldn’t line 93, “while (s < srcend – 4)", instead be "while (s <= srcend – 4)" or equivalently "while (s + 3 < srcend)"?
Twan, you raise a couple of subtle points.
Checking the alignment makes no difference to performance, at least on x86. I’d be interested to see measurements from someone who has access to a non-x86 system.
As for line 93, I had already tried the change you suggested, and it actually slows decoding down in every case. The reason for this is that we need to add another check for s==srcend after we exit from the fast-path loop so that the byte-by-byte path won’t run past the end of the array if the fast path gets lucky consumes the entire array. Since this extra check has to be performed on every iteration through the outer loop, we pay for it both on the fast and slow paths.
It’s thus an overall win for performance (at least with the code structured like this) to leave the last few bytes of the array to be consumed by the “slow path”.
Bryan,
Can you clarify the “C (portable)” and “C (x86)” series in your chart? It looked to me like you should that Haskell was still a LOT slower than C. I assume they refer to the backend code generator used?
Justin, the “C (portable)” numbers are for the decoder that calls out to C without using the special ASCII fast path, while the “C (x86)” numbers are for the C decoder that has the ASCII fast path enabled. That fast path is x86/x86_64 specific for now, hence the label.
Don’t do the alignment check. On Nehalem and newer processors there’s absolutely no performance hit from unaligned memory accesses, except when you cross a cache boundary. Even then it’s faster than two loads.
For a while, metering seemed to be showing that our system was taking a long time to read strings from Oracle as compared to Java. I looked our decoder, a standard library package. It was cleverly written to do all kinds of things fast, using powerful Lisp macros to generate good code. However, I expanded out our own special case and manually optimized it, and that did speed it up a lot. I think the gain was best for small strings, but I’m not sure.