Tim Bray has recently been writing about a simple log file processing task, giving his efforts the (decidedly peculiar) name of Wide Finder. The task at hand is to count popular links in an Apache log file.
Here’s my two minutes’ worth of hat in the ring, in Haskell.
> main = do
> args <- getArgs
> forM_ args $ \name -> do
> m <- (foldl' count M.empty . LB.lines) `fmap` LB.readFile name
> mapM_ print ((take 10 . sortBy (flip compare `on` snd) . M.toList) m)
> where on f g x y = g x `f` (g y)
> count m line = case line =~ "\"GET /en/([^ ]+\\.html)" of
> ((_:g:_):_) -> M.insertWith' (+) g 1 m
> _ -> m :: M.Map LB.ByteString Int
Here’s some comparable Python:
> pat = re.compile(r'.*"GET /en/([^ ]+\.html)')
> for name in sys.argv[1:]:
> d = {}
> for line in open(name):
> m = pat.match(line)
> if m:
> d[m.group(1)] = d.setdefault(m.group(1), 0) + 1
> for i in sorted(d.items(), key=lambda x:x[1], reverse=True)[:10]:
> print i
The Haskell code can chew through 3.2 million records in 10.5 seconds on my laptop, while the Python takes 11.6 seconds.
Both of these programs spend about 90% of their time in regexp matching code, which makes this just a regexp engine and I/O benchmark. Yawn! Can we squeeze a little bit of entertainment out of the problem?
This is a trivially parallelisable problem: there are no data dependencies between different parts of a log file, so we can process them however we please.
Let’s split the input file into chunks of approximately equal size, aligned to line boundaries. This is dead easy to do with almost no I/O: just seek to the nearest chunk boundary, and read a little until we hit a newline.
> chunkedLineBoundaries :: Int -> FilePath -> IO [(Int64, Int64)]
> chunkedLineBoundaries numChunks path = do
> totalSize <- (fromIntegral . fileSize) `fmap` getFileStatus path
> let chunkSize = totalSize `div` fromIntegral numChunks
> bracket (openFile path ReadMode) hClose $ \h ->
> flip fix 0 $ \findOffsets offset -> do
> let newOffset = offset + chunkSize
> hSeek h AbsoluteSeek (fromIntegral newOffset)
> flip fix newOffset $ \loop off -> do
> eof <- hIsEOF h
> if eof
> then return [(offset, totalSize - offset)]
> else do
> bytes <- LB.hGet h 4096
> case LB.elemIndex '\n' bytes of
> Just n -> do
> offsets <- findOffsets (off + n + 1)
> return ((offset, fst (head offsets) - offset):offsets)
> Nothing -> loop (off + LB.length bytes)
The chunkedLineBoundaries
function returns a list of (offset,
length) pairs. We’ll use this to fire off multiple threads, each of
which will consume a single chunk of the file in parallel.
> withChunks :: Int -> (LB.ByteString -> a) -> FilePath -> IO [a]
> withChunks numThreads f path = do
> offsets <- chunkedLineBoundaries numThreads path
> ch <- newChan
> forM_ offsets $ \(offset, count) -> forkIO $
> handle (writeChan ch . Left) $
> bracket (openFile path ReadMode) hClose $ \h -> do
> hSeek h AbsoluteSeek (fromIntegral offset)
> ret <- (f . LB.take count) `fmap` LB.hGetContents h
> ret `seq` writeChan ch (Right ret)
> forM offsets (const (readChan ch >>= either throwIO return))
With this process-a-file-in-chunks function in hand, we must restructure our original code a little to fit in. Here’s the core scan-and-update-the-map loop, which does no I/O.
> reCountLines :: LB.ByteString -> M.Map LB.ByteString Int
> reCountLines = foldl' count M.empty . LB.lines
> where count m line = case line =~ "\"GET /en/([^ ]+\\.html)" of
> ((_:g:_):_) -> M.insertWith' (+) g 1 m
> _ -> m
We’ll give it an alternate name so we can swap in a better implementation later.
> countLines = reCountLines
Because this function does no I/O, we can run it either sequentially or in parallel.
> sequential = fmap countLines . LB.readFile
> parallel = fmap (M.unionsWith (+) . map snd) . withChunks 2 countLines
The parallel
function takes the maps returned by each thread and
reduces them into a single map, giving a result of exactly the same
type as the sequential
function.
> -- kind = sequential
> kind = parallel
By changing the definition of kind
above, we can switch between the
sequential and parallel versions of our code. Now main
becomes just
a framework:
> main = do
> args <- getArgs
> forM_ args $ \name -> kind name >>= \m ->
> mapM_ print ((take 10 . sortBy (flip compare `on` snd) . M.toList) m)
> where on f g x y = g x `f` g y
In order to benefit from the potential parallelism, we have to recompile to use GHC’s threaded runtime. This imposes about a 4% penalty in execution time, so the serial version of the code processs our 3.2 million records in 10.9 seconds instead of 10.5.
Switching to the parallel code, it takes 7.7 seconds to process the same data. We get a less than perfect speedup in part because GHC’s garbage collector runs serially; that results in about 0.7 seconds of serial execution. Still, this is almost twice as fast as the Python code.
Next, let’s get rid of the gratuitous regular expressions, since they’re surely doing a lot more work than necessary for such a simple problem. Here’s a short handwritten replacement:
> fastCountLines :: LB.ByteString -> M.Map LB.ByteString Int
> fastCountLines = foldl' count M.empty . LB.lines
> where count m line =
> let quote = LB.drop (fromJust (LB.elemIndex '\"' line)) line
> in if LB.pack "\"GET /en/" `LB.isPrefixOf` quote
> then let pfx = LB.drop 9 quote
> uri = LB.take (fromJust (LB.elemIndex ' ' pfx)) pfx
> in if LB.pack ".html" `isSuffixOf` uri
> then M.insertWith' (+) uri 1 m
> else m
> else m
Using fastCountLines
as the value of countLines
, this brings
best-case serial execution time (i.e. without the threaded runtime)
down to 5.1 seconds, and parallel execution time drops to 3.5 seconds,
or a third the time required by the original serial-with-regexps
Haskell code.
I would expect a four-core machine to further improve performance, though with an added drop in speedup due to GHC’s single-threaded garbage collector.
The withChunks
function isn’t at all specialised to this task; we
can use it to process any large text file in parallel.
A nit: isn’t “compare `on` snd” the same as “comparing snd” (using Data.Ord)? You can remove the definition of ‘on’ and save a line.
Not to nitpick, but 7.7 is hardly twice as fast as 11.6 (Python). Nice job though.
Really cool post.
Things like this make me happy that you’re part of the “Real World Haskell” thing, because this is exactly the kind of code that reminds me that Haskell is useful for day-to-day code-duggery.
Can you give some hints about how to get the two-minute version to compile?
Most of the imports are easily guessed except LB. LB should include line, readFile, and ByteString, but these seem to be in Prelude, Prelude, and Data.ByteString.
Looking ahead, how is the polymorphism for =~ resolved?
Greg, here’s the list of imports.
import Control.Monad (forM_)
import qualified Data.ByteString.Lazy.Char8 as LB
import Data.List (foldl’, sortBy)
import qualified Data.Map as M
import System.Environment (getArgs)
import Text.Regex.POSIX ((=~))
The result of (=~) is a list of lists, as the first pattern in the case expression indicates. Each element of the inner list is a lazy bytestring, representing a match subgroup.
I think these two statements are true:
1. ghc 6.6 knows about about Text.Regex.Posix, but not about Text.Regex.POSIX.
2. Text.Regex.Posix doesn’t know about lazy bytestrings:
“No instance for (Text.Regex.Base.RegexLike.RegexLike Text.Regex.Posix.Wrap.Regexe
Data.ByteString.Base.LazyByteString”
Or am I misunderstanding something? The example compiles with Data.ByteString.Char8
Thanks again.
Oops, yes, typo there. And I’m using a newer version of the regexp libraries, which do work with lazy ByteStrings.
Bryan — many thanks for very instructive Haskell. Is there a complete version to download and benchmark with both GHC 6.6.1 and 6.8.1?
Nice speedup.
It would be interesting to see how Haskell compares to a similarly sped-up python version.