Every functional programmer worth their salt seems to end up with at least a few stories to tell about programming in CPS, also known as continuation passing style. Here’s my latest one.
As a user of it, you can’t tell that my attoparsec parsing library is implemented almost entirely using explicit continuations. Every combinator accepts two continuations:
the failure continuation is invoked if the current function fails
the success continuation is invoked if the current function succeeds
Making matters more complex, each continuation accepts three other parameters:
the input currently known to be available
any additional input that was received when parsing was suspended due to insufficient input (to support backtracking in the case of a failed parse)
an end-of-input marker, to record whether our caller has no more input to feed us incrementally
I originally implemented attoparsec as a very traditional state transformer monad. Every bind action would check to see if the previous action succeeded. If yes, it would invoke the next action, otherwise it would pass the failure forwards. While this worked, its performance was disappointing: my parsers didn’t run any faster than those built on the venerable parsec package!
The work of switching from a traditional state transformer over to CPS took just a couple of days. I’ve had my bacon repeatedly saved by Haskell’s type system, which ensures that I’m not passing the wrong kind of continuation in the wrong slot, or mixing up input-I’m-going-to-consume with input-I’ve-been-fed.
Switching over to CPS bought a factor-of-eight performance improvement, which delighted me. The code remains fairly easy to follow, because I was careful to write it cleanly (of all the places where excessive cleverness can bite you, working with CPS must come close to the top).
Looking at the Core generated by GHC, though, suggests that there’s plenty of room for improvement: it allocates tons of closures. And sure enough, they show up at runtime.
See those items marked THUNK in the chart above? Yeah, ouch.
Nevertheless, that 8x performance improvement is real, but I actually managed to further improve on it for the aeson JSON library. In that library, some careful profiling indicated that dealing with text in general, and Unicode escapes in particular, was a serious bottleneck. Having run out of obvious paths to speed the code up in its existing form, I wrote a tiny and highly focused module, Data.Attoparsec.Zepto, that foregoes CPS in favour of the state transformer approach. For non-recursive parsers that shouldn’t fail (e.g. dealing with escaped text), this module performs very well: I achieved a 35% performance boost in JSON parsing when I introduced it! That approach really only seems to work well in that single isolated instance, unfortunately: state transformers continue to kill performance if I try to use them more widely within attoparsec or aeson.
In the short to medium term, then, CPS is usually a win, but we can do better: I’m hoping to help the Simons ferret out whatever’s causing continuations to be compiled into closures instead of straightline code. GHC 7 currently contains some pretty big performance regressions for CPS-heavy code, but they’ve been partly ameliorated by a recent patch (regression dropped from 70% to 22%). I think we’ve plenty of scope to make performance substantially better just via compiler improvements.
I’m going to venture that there is a good chance the 8x performance improvement did not really come from CPS; it might have been from an incidental change.
I say this because a number people have noted that applying ContT to Control.Monad.State results in a very dramatic performance improvement. This isn’t because CPS is “faster”, but rather the continuation transformer turns lazy state monads into strict state monads. An apples-to-apples performance comparison is much less clear.
In an early experiment with the Snap Framework, I rewrote the Snap monad to use continuations instead of a strict state monad. According to the benchmarks that we’ve used, it resulted in slightly less performance. (I’m not convinced that our benchmarks are terribly meaningful, but that’s a separate issue.)
I know that Parsec 3 also got a significant performance boost by changing to CPS, and that this has been observed elsewhere. In the end, I wish I had a better understanding or intuition about the performance tradeoffs between CPS and a strict state monad.
Leon, I know exactly where the performance difference came from, because I measured carefully at every step, and experimented with both lazy and strict state in each instance. That is, after all, my thing
This ticket shines some light of why CPS is sometimes slow:
http://hackage.haskell.org/trac/ghc/ticket/4978
Bryan, I’m not impugning the superb quality of your work. =) I am pointing out that a careful experimental approach doesn’t explain why something is faster, even though it is a wonderfully pragmatic way of improving performance.
Maybe you really truly do understand why CPS made attoparsec faster, but I don’t, and this post doesn’t even attempt to explain that. We both know that GHC is a complicated beast; this would be much easier for me if we were discussing the performance of some Scheme programs relative to some Scheme compiler.
So I have this nagging suspicion that the real reason why CPS improved Attoparsec is some kind of Heisenberg-like incidental change, maybe to the order of evaluation. This hunch is by analogy with my past experiences. I’m not saying the details are the same; I’ve read enough of Attoparsec to know that isn’t true. But certainly I’m not satisfied with the explanation.
One reason why CPS improves performance is that I can remove the need for pattern matching and constructor allocation.
Compare the algebraic data type and CPS versions of e.g. bind in the Maybe monad. The CPS version never allocates Nothing and Just constructors and never needs to pattern match on these constructors.
In theory GHC could convert the data flow (the ADT version) into control flow (the CPS version) but doesn’t always seem to. Doing so likely relies on heavy inlining and case-of-case transformations.
Tibbe, to an extent, I understand the general implementation tradeoffs with control versus data. However I would be a somewhat surprised if the general tradeoffs alone explain the 8x increase in performance.
Might this represent an implementation weakness in GHC? Since everything in Haskell potentially creates a closure, it’s important for GHC to aggressively minimize the size of and eliminate closures. Maybe this (partially) explains why GHC tends to give good performance on CPS code, whereas data has some kind of weakness well.
Why do some monads respond very positively to CPS, and others don’t, such as Snap versus Attoparsec? Is it because Attoparsec backtracks more often? Is it because Attoparsec has a better performance test suite and Snap didn’t manage to highlight the difference?
Leon,
I think Snap doesn’t benefits much from CPS because there’s a lot of work being done between each continuation invocation. In attoparsec, there’s very little work between each continuation invocation. Put in another way, attoparsec has a lot of small functions (e.g. pull a single byte of the input stream and apply the continuation) while Snap uses comparatively bigger ones.