I’ve spent some time over the past few weeks improving the performance of the attoparsec parsing library, and of the aeson JSON library. Since they’ve now reached a new plateau of performance and stability, I thought this would be a good time to release new versions.
The major advance in the new version of aeson is a considerable speed improvement.
The datasets I’m using are Twitter search results, from the Twitter JSON search API. For mostly-English results, 0.2.0.0 is up to 30% faster than before, while on Japanese data (which makes heavy use of Unicode escapes), I’ve bumped performance by more than 50%.
To see how well aeson performs compared to JSON parsers for other languages, I compared it against the json module in Python 2.7. That module’s JSON parser is written in C, so it’s very fast indeed, and the amount of actual Python being executed in my microbenchmark is tiny. How do we fare?
On mostly-English data, aeson is actually faster than Python’s native-code json parser. Nice! And on Japanese data, we’re a little slower, but still very competitive.
What if you’ve been using the Haskell json package, which was the first open source Haskell JSON parser to be published? Well, I do think that aeson is easier to use, but it’s also 3x faster than the json package:
The new version of aeson introduces some other useful improvements.
- There’s a new Generic module, which lets you convert almost any instance of the Data typeclass to and from JSON without writing boilerplate code. (Be warned: generics are slow. If performance is important to you, write that boilerplate!)
- We introduce a Number type that represents integers to full accuracy, and which handles floating point numbers efficiently.
- Instead of parsing via the Applicative typeclass, we now use a custom parsing monad, improving both ease of use and performance.
I wonder if you are really benchmarking Python’s JSON parser, or allocation+instanciation of string/number/hash/array objects in Python.
Numbers actually do not seem super impressive (throughput), and so suspect that C version tested via Python is not amongst fastest ones; or that indeed (as suggested) Python overhead is big. One can not just assume that Python overhead is small, based on number of code lines, it absolutely and positively must either (a) be profiled to prove, or (b) run the C version directly without wrapper.
This is not to say improvements are not nice, or that this wouldn’t be good for Haskell. Just that C comparison part seems pretty much bogus.
Bryan,
Thank you for another great contribution.
Just to let you know – it seems that your Riak package doesn’t compile with this version of aeson.
Cheers,
Oz
I should really test this and tell you, but I have a strong suspicion that your generics module won’t work right for mapString & friends, because, essentially, dataCast1 doesn’t do what one would expect for partially specialized constructors of arity 2. See this stack overflow discussion: http://stackoverflow.com/q/4319982/371753
As of 0.3, syb has the `ext2` family of functions defined, so that’s one way to do it. Browsing around, I also note that ekmett has uploaded syb-extras (http://hackage.haskell.org/package/syb-extras-0.2.0). I’m not exactly what this does yet, but it looks interesting 🙂
In this blog post, you focus on the improvements to aeson, which is great. Thanks for the nice work and a nice post. But you mention that there were also some changes and improvements to attoparsec. Can you give us a little more detail about that?
See some more specific questions on the reddit thread.
Thanks,
Yitz
I thought I posted something before on this, but it doesn’t seem to have shown up. In any case, I just confirmed that even though you have code in the Generic module for handling Map String a, Map Text a, etc. it doesn’t work…
Prelude Data.Aeson Data.Aeson.Generic Data.Map> Data.Aeson.Generic.toJSON $ (1,Data.Map.fromList[(“foo”,Number 1),(“bar”,Number 12)])
Array fromList [Number 1,*** Exception: Data.Aeson.Generic.toJSON: not AlgRep NoRep(DataType {tycon = “Data.Map.Map”, datarep = NoRep})
There’s a subtle gotcha involved. The ext1 family doesn’t work on data constructors of arity 2 which are partially specialized. The reason this is confusing is that gcast *does* work on such constructors, while dataCast doesn’t.
You need to use the ext2 family (new to the latest release of syb) and some ugly typecasing to get the effect you want.
@sclv, any chance you might send a patch in, please?
I’d like to know how the utf-8 thing slowed down the current implementation, as Twitter stores Japanese characters in UTF-8. Since I am quite new to Haskell but I am also non-Englsih speaker which need to work in unicode for my primary language (Chinese, and Chinese has two flavors too).
As Python 3 is very unicode friendly (default to unicode strings actually), is Haskell in any way get the same performance for UnicodeByteString, or, alternatively, proved to be slower than Python by this test?
@itsnotvalid: It’s not UTF-8 that slows down parsing, it’s the handling of escaped Unicode characters of the form \u1234 (and sometimes \u1234\u5678). The handling of unescaped UTF-8 is plenty fast – that’s why we see good performance on English data.
Not bad at all fellas and galals. Thanks.