I'm pleased to announce the availability of criterion, a new library for measuring the performance of Haskell code.
Compared to most other benchmarking frameworks (for any programming language, not just Haskell), criterion focuses on being easy to use, informative, and robust.
Here's a canonical benchmark-worthy function, which has the desirable properties of being both small and slow:
-- Save this as a file named "Fibber.hs".
import Criterion.Main (defaultMain)
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
To compile a program that will benchmark our fib
function, we write a trivial main
entry point:
-- Append this to the end of "Fibber.hs".
main = defaultMain [
bench "fib 10" $ \n -> fib (10+n-n)
, bench "fib 30" $ \n -> fib (30+n-n)
, bench "fib 35" $ \n -> fib (35+n-n)
]
Criterion itself is trivially easy to install.
$ cabal install criterion
Once it's installed, we build the application we intend to benchmark.
$ ghc -O --make Fibber
If we run the Fibber
program with the --help
option, it prints a large amount of help output. The most interesting options early on are -t
, which displays raw timing information, and -k
, which turns the timing information into a chart of the probability distributions of timings. So for immediate gratification, we can run Fibber
like this:
$ ./Fibber -t win -k win
The win
argument to each option means "display these charts in a window". For this article, I used the png
argument instead, to get nicely sized images that I could simply drop into WordPress.
$ ./Fibber -t png:450x175 -k png:450x175
Here's an example of raw timing information:
Notice that the times on the Y axis are labelled with their units, in this case microseconds. On the X axis is the run count.
Timing information is somewhat nice to have, but much more useful is an indication of which times occurred most frequently. The traditional way to display such information is with a histogram. Histograms are pretty awful, since a histogram with an incorrect bin size is either misleading or useless, and bin sizes have to be chosen by hand. Much more useful is a kernel density estimate:
This is a chart of the estimated probability densities of timings. We observe a big hump at about 5.23µs, with a few small outliers further out. This suggests that it usually takes about 5.23µs to evaluate fib 10
.
Meanwhile, in our console window, our Fibber
program has printed a lot more information. Here's how it starts:
estimating clock resolution...
mean is 9.956511 us (80001 iterations)
found 3711 outliers among 79999 samples (4.6%)
3505 (4.4%) high severe
The first thing that Fibber
does is measure how long it takes for the system clock to tick. It will then run our benchmarked code enough times that the resolution of the clock will not introduce a significant error.
Indeed, our first benchmark shows why this is important: if fib 10
takes 5.23µs to evaluate, but the clock ticks once every 9.96µs, we clearly can't just evaluate fib 10
once, and subtract the start from the end time. Sometimes it will appear to evaluate instantaneously, since the clock won't have ticked, and sometimes it will appear to have taken about 10µs, since the clock will have ticked exactly once. Both answers would be worse than useless.
Instead, we evaluate fib 10
many thousands of times in order to get the measurement error introduced by the resolution of the clock down to a few parts in ten thousand.
We also use something called the boxplot technique to develop a quick sense of the quality of our data. In this case, a few percent of our numbers are significantly off from the mean of the sample.
Once we've characterised the system clock's period, we figure out how expensive it is to actually use the clock.
estimating cost of a clock call...
mean is 923.3224 ns (58 iterations)
found 4 outliers among 58 samples (6.9%)
1 (1.7%) high mild
3 (5.2%) high severe
We adjust our timing measurements to take the cost of clock calls into account, even though in practice this makes almost no difference to the numbers we report. (If you're measuring an event that occurs on a millisecond-to-second time scale, you wouldn't expect a one-microsecond difference to perturb things much!)
The last part is the most interesting. We automatically figure out how many times we need to evaluate fib 10
:
benchmarking fib 10
collecting 100 samples, 1911 iterations each, in estimated 995.8188 ms
In this case, we evaluate fib 10
1911 times and measure the cost of this, then repeat this measurement 100 times. We print an estimate in advance for how long the measurements will take, so that if you're benchmarking something expensive, you can go for coffee or whatever. (If you can fetch coffee in 995.8 milliseconds, more power to you!)
Why do we measure so many times? Because we're going to do some statistical analysis of our measurements to see whether they are trustworthy.
bootstrapping with 100000 resamples
The bootstrap step takes a second or two, and this is where the interesting stuff takes place. The performance of real-world applications doesn't follow some kind of tidy statistical pattern such as the normal distribution, especially when people are measuring in realistic, noisy environments.
If you run a time-consuming benchmark on your laptop, you're likely to want to switch to your web browser, watch a video on youtube, check your mail, and so on. Your CPU might slow down because it's overheating, or your laptop's ACPI subsystem might change the CPU frequency to conserve power.
In other words, there are lots of things that might perturb our benchmarking results, some of which we can't see or control, but then again they might not occur or might not be important, and our numbers could be fine. How can we tell?
We use a statistical technique known as the bootstrap, with which we take some sample data and perform some number crunching to tell us interesting things. Specifically, we can use the results of the bootstrap to tell us whether the outliers in our measurements (timings that are far from the mean) are perturbing our numbers in a significant way.
The first thing that the bootstrap tells us is the mean and standard deviation of our measurements, along with the 95% confidence intervals for those values.
mean: 5.232800 us, lb 5.222862 us, ub 5.262482 us, ci 0.950
std dev: 79.86726 ns, lb 31.06332 ns, ub 173.0716 ns, ci 0.950
It then reports on the outliers in the measurements, and most importantly, tells us whether they are important.
found 7 outliers among 100 samples (7.0%)
3 (3.0%) high mild
4 (4.0%) high severe
variance introduced by outliers: 0.993%
variance is unaffected by outliers
I ran the above example on my laptop when it was idle, so the timing measurements are fine. Boring!
What about a more interesting example? I measured the performance of fib 35
:
collecting 100 samples, 1 iterations each, in estimated 88.14368 s
While this executed, I ran a stupid but computationally expensive task in another window:
while true; do
for ((i=0;i<20;i++)); do
md5sum cufp.odp >/dev/null &
done
for ((i=0;i<20;i++)); do
wait
done
done
This shell script grinds over a largish file repeatedly, but does so using 20 processes at a time. It reliably drives the load average on my laptop up to 8. What did it do to my measurements?
As the timing numbers suggest, I ran my CPU-chomping script twice while taking measurements. This had a noticeable effect not just on the timings, but on their probability distributions too:
Those are two fairly large humps, which certainly seem like they should give cause for concern. What does Fibber
report on the console?
mean: 1.566470 s, lb 1.333717 s, ub 1.855111 s, ci 0.950
std dev: 1.324608 s, lb 1.114684 s, ub 1.510580 s, ci 0.950
found 23 outliers among 100 samples (23.0%)
22 (22.0%) high severe
variance introduced by outliers: 52.994%
variance is severely inflated by outliers
Notice that the standard deviation of the sample is almost the same as the mean, a classic sign of sketchy performance measurements. However, we don't merely have to spot this suspicious pattern ourselves: by analysing the results of the bootstrap, our benchmark automatically tells us that our numbers are, in effect, junk.
So! Thanks to criterion, you can now easily measure the performance of Haskell code, and even better, you'll (usually) be told when you're lying to yourself. Oh, and you can produce those charts in PDF, SVG, and CSV formats, too, in addition to PNG and windowed output, so you can simply drop camera-ready copy straight into your next Haskell Implementor's Workshop submission!
There's plenty of work yet to be done. In order to improve the rigour of the bootstrap analysis, I want to perform an autocorrelation test on the timing measurements, since an autocorrelated sample can distort the result of a bootstrap analysis. I'd also like to be able to produce charts with many estimates overlaid, to allow things like scaling and comparative analyses. But I think that criterion is off to a solid start. If you'd like to work on any of these, let me know: I think that the code is quite clean and easy to work on, and of course the darcs repository for the source is easy to get:
darcs get http://darcs.serpentine.com/criterion
It should even be easy to use criterion to benchmark C code and command line programs.
Much of my work on criterion was inspired by Brent Boyer's excellent articles on Java performance analysis (part one and part two). Amusingly, Java and Haskell each pose substantial benchmarking challenges: in the case of Java, dynamic recompilation makes getting good numbers very challenging, while for Haskell, lazy evaluation brings its own difficulties.
Cool stuff! What happens if I have a process running in the background that adds a consistent CPU load rather than spikes? Can that still be detected?
Johan, it would be impossible to notice a consistent source of unvarying load, short of writing platform-specific hacks to see what “ps” says or the like.
Looks pretty exciting! I hope this means we can kill the maybench project.
Meanwhile, I’ve filed a bug on the Darcs tracker about possibly integrating this into our darcs-benchmark package:
http://bugs.darcs.net/issue1631
Thanks
Hi Bryan,
This is awesome; thanks!
Here’s a newbie question: I tried to install it but it complains about gtk missing (and cairo too). ‘cabal install gtk’ however fails with ‘no package named gtk’. (I’m running Leopard.)
Can you perhaps add some installation notes that says how to get (custom) required packages or what machines criterion is supposed to run on?
Dear Martijn,
gtk is not a haskell package – but rather, as I understand, a placeholder for a system package.
The best way is to install MacPorts and do “sudo port install gtk2hs”. Everything will be installed after that.
Martijn, I tried to get Criterion working on Windows and I had to install Gtk2Hs (http://www.haskell.org/gtk2hs/) and run cabal install under MSYS/MinGW before it actually worked. I noticed that Gtk2Hs is also available in MacPorts, so you might like to try installing using that if you are already using MacPorts.
Also, even though it works under Windows, I seem to be getting some very strange results using the sample code from the post, like:
found 79959 outliers among 39999 samples (199.9%)
I have a question related to the lambdas being benchmarked: \n -> fib (10+n-n)
Why the addition and subtraction of n? Is this to prevent GHC of optimizing it? Does GHC optimize (\n -> fib 10) to fib 10?)
Why the dependency on GTK at all? It’d be nice to use Criterion to benchmark DPH code, but I certainly don’t want to depend on GTK.
Hi Bryan,
Great job and thanks for this library which seems really cool!
I went to the hackage and the doc is not built, this is a bit annoying. Is there a way to solve this problem?
Very true. Having sensible empirical data to guide data structure use is invaluable. I wonder if collections of these graphs should work their way into hackage somewhere?
Manuel:
Criterion depends on gtk2hs in order to draw it’s charts via the cairo graphics library.
Things would be much improved if gtk2hs could be installed via cabal, or, ideally, it become part of the haskell platform.
I notice also that the upcoming tool for parallel code analysis “Threadscope” relies on gtk2hs also.
Awesome work!
Any plans to also benchmark memory usage?
Confidence Intervals v. Bayesian Intervals
Derek, I have Jaynes’s monster Bayesian textbook at home
Thank you for the reminder to add the
(x+n-n) ==> x
special case to ghc’s optimizer
Nice work, Bryan!
The first example given does not compile (did no one else even try it out?)
$ ghc -O –make Fibber
[1 of 1] Compiling Main ( Fibber.hs, Fibber.o )
Fibber.hs:12:9: Not in scope: `bench’
Fibber.hs:13:9: Not in scope: `bench’
Fibber.hs:14:9: Not in scope: `bench’
I modified the first line to
import Criterion.Main (defaultMain, bench)
which keeps the compiler happy.
The last line says “lazy evaluation brings its own difficulties”. Does Criterion address the problem of lazy evaluation? And if so how?
This is truly a pearl in the Haskell world!
Yes, what was that fib(10+n-n) business all about?
0.1.2 works on OS X (minus the graphical output). Sweet!
Could criterion perhaps be split into a package that doesn’t depend on gtk2hs and one that does? Installing gtk2hs is quite inconvenient, particularly on Windows.
Ganesh, the new release can be built sans gtk2hs/cairo dependency, via “cabal install -f-chart”.
Bryan: oh, thanks. I had read your blog post about using it on the Mac but hadn’t thought about how -f-chart would solve previous problems
Thanks, Bryan! This is just what I was searching for. Installation as usual, one blogpost to understand it and all the nice features one could want.
Your library fills a niche for the run-of-the-mill developer.
Bryan, thanks for the library. I’ve used it to benchmark some scripts and it improved my understanding of the results, otherwise I wouldn’t have bothered to do statistical processing on them.
Hi and thanks for your post .. It has got so many useful info ..
I installed Criterion package,libpng .. on ubuntu 9.10 which has got ghc 6.12 .
Unfortunately when I run the command it gives me errors and says it can’t plot the png output .
./profile -t png:500×175 -k png:500×175
output type PNG 500 175 not supported on this platform
output type PNG 500 175 not supported on this platform
Why is it so ? What else do I need to be able to actually see the graphical result ?
I read the above comments and I got this feeling that I have to install gtk2hs ? Yeah ?? Actually, I installed gtk2hs-0.10.1, but it seems that it can’t be installed completely :
Forexample I get this error after I run make install :
ghc: can’t find a package database at package.conf.inplace
make: *** [glib/System/Glib.o] Error 1
Is there any way to get this fixed ? Do I really have to install gtk2hs ? It seems it has got some bug in it which it can’t be installed on ghc 6.12 ! Could you please help me with this ?
Thanks,
Why is it so ? What else do I need to be able to actually see the graphical result .
If plotting does not work for you, you can always generate the csv file via -t csv on the command line when running the criterionised application, and then use e.g., R to generate plots in any desired format.
For example,
library(stats);
criterion_output_values <- read.csv("output",header=TRUE,sep=",")[,2];
pdf("output.pdf",paper="special",height=6,width=6,pointsize=10);
plot(density(criterion_output_values), main="MyBenchmark criterion times", xlab="Execution time", ylab="density");
dev.off();
q();
I’d like to draw your (i.e., the commenters/readers) attention to the following workshop, http://evaluate2010.inf.usi.ch/. It’s purpose is to gather people concerned about the state of experimental computer science and the way in which we benchmark systems, languages, implementations, etc. because things are not as nice as we often like to think.
“-fchart” flag to cabal install helps with runtime errors like this one:
ERROR: output type Window 800 600 not supported on this platform
If you encounter problems like “ERROR: output type Window 800 600 not supported on this platform”, You can try the following (I got it working on my Linux machine with GHC 7.0.2:
1)
cabal update && cabal install gtk2hs-buildtools cairo gtk
2)
cabal install -fChart -fgtk
(or change both flags in the .cabal-file to True if they are defaulted to False)
3)
./Fibber -t win -k win
or
./Fibber -t png:450×175 -k png:450×175
Then the images will generate well
Thanks. That’s a good read on Criterion, a new benchmarking library