A few weeks ago, I spent a little time porting Peter Norvig’s Python code from his article How to Write a Spelling Corrector to Haskell. It’s a concise vehicle for a few Haskelly topics that don’t get much airing in front of a general audience: what idiomatic Haskell looks like, and the dreaded space leak.
Bear in mind as you read this that I’m porting Norvig’s algorithm, not implementing the approach that would be best suited to Haskell (dynamic programming). And I’m trying to do so idiomatically, so that the resulting code looks like decent Haskell. This isn’t an attempt to “golf” the code for either performance or terseness. Neither am I trying to show that either language is better than the other. Whew, enough with the disclaimers.
A Haskell module starts with a series of import statements that make
names from other modules available in our namespace. A typical
import
section is likely to explicitly list every name it imports
from a module, as here.
> import Control.Monad (liftM)
> import Data.Char (isAlpha, toLower)
> import Data.List (maximumBy)
There are some good reasons to list every name we import. It’s a helpful piece of self-documentation. By naming everything we import, we can later see which potentially unfamiliar module we’re importing a name from, without needing to search documentation or scratch our heads.
Explicit naming also reduces the likelihood of namespace collisions.
It’s extremely common for Haskell modules to export functions with
identical names but different types. If we import three modules that
each export the name foo
, but we only import the name foo
from
one of those modules, it’s clear to both the compiler and our reader
which one we must want to use.
Another good way to avoid namespace collisions while preserving readability is to import all names from a module en masse, but in such a way that we have to qualify those names when we refer to them. Here’s what I mean by this.
> import qualified Data.Map as M
Given the above qualified import, if we want to refer to the name
empty
as exported by Data.Map
, we’ll need to write M.empty
.
Now for our first mention of performance. Haskell’s built-in String
type is well known to be death to performance, as it’s no more than a
linked list of boxed Char
s. The effect this has on performance is
easily measurable on data sets as small as a few hundred kilobytes,
even on a fast CPU. The modern response is to use the newer, much
faster ByteString
types.
There are two kinds of ByteString
. The “lazy” variant follows the
Haskell norm of deferring work until it’s actually needed, while the
“strict” variant completes its work immediately. To give you an idea
of what this means, the lazy ByteString
version of the readFile
function reads a file in small chunks, on demand, while the strict
variant reads the entire file into memory in one go. Strict
ByteString
s can beat their lazy cousins on performance metrics, but
this is fairly rare. In this case, when I ran some time measurements,
they showed lazy beating strict by a factor of two.
> import qualified Data.ByteString.Lazy.Char8 as B
Let’s return to namespace clashes for a moment. The ByteString
modules each contain dozens of names that clash with names defined
in Prelude
, the module that is always implicitly imported into every
Haskell module. This is why we import them using the qualified
mechanism. (And by the way, switching from lazy to strict
ByteString
s throughout this module is a simple matter of deleting
the .Lazy
string from the imported module name above.)
In his Python spellchecker, Norvig uses a Python defaultdict
, in
other words a hash table, to hold his model data. Hash tables aren’t
suitable for use in pure functional code, so we use a Haskell Map
instead.
> type Model = M.Map B.ByteString Int
Norvig’s words
function has a similar form in Haskell.
> -- def words(text): return re.findall('[a-z]+', text.lower())
> mkWords :: B.ByteString -> [B.ByteString]
> mkWords = filter (not . B.null) . B.splitWith (not . isAlpha) . B.map toLower
Our mkWords
is a composed pipeline of functions, which we read from
right to left.
B.map toLower
returns a lower-cased copy of its input string.B.splitWith (not . isAlpha)
returns a list of strings, split everywhere thatisAlpha
returnsFalse
.filter (not . B.null)
eliminates any empty strings from the result of the call toB.splitWith
.
The mkWords
function is written in “point free” style, meaning that
although it takes an argument, the argument isn’t referred to in the
function definition. (To make matters confusing, the word “point” in
“point free” really means “value”, not any kind of punctuation
symbol.) Point free style can make one-line functions clearer, but it
makes longer definitions much harder to understand and modify.
Notice that my mkWords
function has a type signature. None of the
type signatures in this module is actually necessary; the compiler can
infer correct types for every value without ambiguity. But each one
makes the code more readable to me, which is a huge benefit. If I
were trying to golf this code, I’d have killed off all of the type
signatures, and along with it about half of the code’s grokkability. For example, I’d have to work out for myself that the point-free function definition above is in fact a function of only one argument, and what its type is.
Where there’s a clear similarity between the Python and Haskell
words
functions, the train
function has a radically different form
in Haskell.
> -- def train(features):
> -- model = collections.defaultdict(lambda: 1)
> -- for f in features:
> -- model[f] += 1
> -- return model
> train :: [B.ByteString] -> Model
> train = foldl' updateMap M.empty
> where updateMap model word = M.insertWith' (+) word 1 model
The observation here is that building the model is actually a “reduction” or “fold” operation: we start with an empty model, and fold over the input. Each fold/reduction step constructs a new model from the previous model and the next piece of input.
The Python code is a fold, too; there’s just not a clean way to write
a fold that augments a dict
in Python. So instead we have a loop
that somewhat obfuscates the fact that we’re
folding.
Folding isn’t an especially hard thing for neophytes to grasp, but
here lurk a few demons. It is in the definition of train
that Haskell’s laziness holds the power to bite us, in the form of
the dreaded space leak. There are two potential pitfalls here, one of which nearly bit me as I was writing this article.
Our first pitfall is that there are two ways we can fold over a list.
We can consume it from left to right, using foldl'
, or from right to
left, using foldr
. The choice of which order to use is not at all
arbitrary.
The strict left-to-right fold is best for cases when an intermediate result
is of no use to our caller. This is one of these cases: the caller
will use the final complete Model
; it can’t see or do anything with
incremental pieces of it.
Note that if we didn’t use the strict left-to-right fold foldl'
, but instead used the lazy variant (foldl
) we’d potentially get a space leak. Instead of constructing a new value
for the next step in the fold, each step would construct a lazily
deferred expression (called a thunk) that wouldn’t be evaluated
until the entire fold had completed. As a result, we’d construct and
return a huge thicket of thunks instead of a single tidy value.
A lazy right-to-left fold is frequently used in lazy functional programs,
as it’s suitable for a caller that will lazily consume a result.
However, if the caller needs a complete result (as in our case),
foldr
will generate a pile of thunks as it recurses down its input
list, and they won’t be consumed until foldr
reaches the end of the
list.
Understanding when to use foldr
or foldl'
isn’t actually difficult;
it’s just not something that newcomers to Haskell will naturally think
about. Worse, the compiler can hide some of the effects of the choice of fold under some circumstances. When I first wrote this article, I was using foldl'
, but there seemed to be no difference in performance or space usage compared to foldl
(lazy), so I switched to that. GHC’s strictness analysis was preventing me from noticing the problem with foldl
at high optimisation levels, because it turned the lazy fold into a strict one for me.
Our second potential problem is that the choice of fold direction
isn’t the only possible source of a space leak above. The expression
M.insertWith' (+) word 1 model
does one of two things when adding a
key to the Model
map.
- If no such key exists in the map, it adds a new entry, with a key of
word
and value of1
. - If the key is already present, it creates a new map with the entry’s value updated, incremented by one.
The potential problem here is that in the “update an existing key”
case, nothing is immediately going to inspect the new value. Thus a
Haskell implementation will construct a thunk to defer its evaluation.
Repeated updates of the same entry will thus construct a series of
thunks of the form 1 + 1 + 1 + ...
, instead of a single Int
, and
this pile of thunks won’t be evaluated until the first time someone
tries to use that value.
Here, we’ve avoided this possible space leak by using the
M.insertWith'
function, which is deliberately strict: it forces the
new value to be evaluated immediately. Haskell functions that are
stricter than the default usually have a tick ('
) character at the
ends of their names.
We now have two examples—in a simple-looking two-line function!—to demonstrate that avoiding space leaks in Haskell can be a subtle business for the uninitiated.
Space leaks are bad not merely for the memory they consume, but also because of their time effects. Build a big data structure in a leaky way, and the garbage collector will expend a lot of futile cycles traversing it.
And now, on we go to our next stop. The edits1
function in Python
generates a set
of all single-character edits of a word.
> -- def edits1(word):
> -- n = len(word)
> -- return set([word[0:i]+word[i+1:] for i in range(n)] +
> -- [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)]
> -- [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet]
> -- [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])
> edits1 :: B.ByteString -> [B.ByteString]
> edits1 word = concat
> [[h `B.append` B.tail t | (h, t) <- hts'],
> [B.concat [h, (B.take 1 . B.drop 1) t, B.take 1 t, B.drop 2 t] | (h, t) <- hts],
> [B.concat [h, B.singleton c, B.tail t] | (h, t) <- hts', c <- alpha],
> [B.concat [h, B.singleton c, t] | (h, t) <- hts, c <- alpha]]
> where hts = [B.splitAt n word | n <- [0..fromIntegral len]]
> hts' = take (len - 1) hts
> len = fromIntegral (B.length word)
> alpha = ['a'..'z']
If you find the above Haskell code less readable than its Python counterpart, this is not especially unusual. String manipulation code looks particularly clean in Python, at least to my eyes, while the corresponding Haskell is often more clunky.
The Haskell version of edits1
dispenses with sets, instead returning
a list of strings. This list can contain duplicate entries, while the
set
returned by the Python version cannot. It’s surprisingly
expensive to try to eliminate the duplicates in Haskell, but having
them present has no effect on correctness, as far as I can tell.
Our penultimate stop is the correct
function. The Python version
uses the short-circuiting or
operator to avoid the potentially
expensive work of searching for single-character edits if the word is
already present in the model, and it avoids the even more expensive
search for two-character edits if it finds a suitable single-character
edit.
> -- def known_edits2(word):
> -- return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
> -- def known(words): return set(w for w in words if w in NWORDS)
> -- def correct(word):
> -- candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
> -- return max(candidates, key=lambda w: NWORDS[w])
Our Haskell counterpart looks horribly inefficient by comparison. Oh noes! It constructs a complete list of all single- and two-character edits!
> correct :: Model -> B.ByteString -> B.ByteString
> correct model word = (maximumBy cmpFreq . head . filter (not . null))
> (map known [[word], edits1 word, edits2 word] ++ [[word]])
> where known = filter (`M.member` model)
> edits2 = concatMap edits1 . edits1
> findFreq word = M.findWithDefault 1 word model
> cmpFreq a b = compare (findFreq a) (findFreq b)
However, lazy evaluation works to our advantage here. What actually
happens is that the call to head
causes only the first non-null
candidate (from filter
) to be consumed. The construction of the
rest of the list is deferred as a thunk that will never subsequently
be evaluated. We thus get the short-circuiting behaviour of Python’s
or
operator for free: our expression could compute all of
these expensive matches, but it doesn’t actually do so unless we
really need them.
Finally, we separate the reading of data and interaction from our
referentially transparent code above with a little bit of “lifting”.
Our model construction and correction functions are completely pure;
they have no side effects. But we need to read data from a file,
which is an effectful operation. The liftM
function takes the pure
code on the left, and hooks it up to the impure action on the right,
so that the result of the pure code is available within the impure
IO
monad.
> readModel :: FilePath -> IO Model
> readModel name = (train . mkWords) `liftM` B.readFile name
> main :: IO ()
(Hark back to the discussion of lazy ByteString
s earlier. The
B.readFile
function reads the entire model in small chunks, so the
memory usage of this function is proportional to the size of the model
we construct, with a constant factor for I/O that’s independent of the
size of the input file.)
> main = do
> model <- readModel "big.txt"
> interact (unlines . map (B.unpack . correct model . B.pack) . lines)
The definition of main
tersely says that we should construct our
model, then use it to correct words fed to us on stdin
, one per
line. The argument to interact
is once again a pure function;
interact
handles all the dirty work of reading input and writing
output.
That’s us about done. Separation between pure and impure code; space leaks partly demystified; and no cheerleading.
Why not use foldl’ instead of foldl? As I understand, foldl will just accumulate thunks. I thought in order to be completely strict you need both foldl’ as well as a strict function such as insertWith’?
Wow !
I’m not sure whether this was intended to be used as a tutorial, but it really is the best example of Haskell I have ever read !
I wish you had written it a year or two earlier…
I am to show your article to many people I know who have been frightened with the usual Haskell articles…
P!
Wow!
I worked on this problem a month or two ago when Norvig’s spellchecker hit Reddit, but my Haskell implementation was ugly, inefficient and probably incorrect. Your elucidation clarifies all of the problems I ran into and results in beautiful, concise code. This is a service to the world and I thank you!
For more discussion on the subject, see this thread on the cafe mailing list:
http://thread.gmane.org/gmane.comp.lang.haskell.cafe/21780
petekaz is correct. As illustrated on http://www.haskell.org/hawiki/StackOverflow foldl will accumulate thunks and lead to a stack overflow in just about all cases. foldl is pretty much useless. foldl’ should be what is used if you have a strict function and foldr otherwise.
I’ve fixed the article. Oops! In my defence, I had substituted
foldl
forfoldl'
and found no difference in performance or memory use, and was even looking at profiler output and seeing a nice slow heap growth instead of an explosion. I conclude that the compiler was being cleverer than I was.It does that. With optimisation, it seems to me foldl never causes trouble. Why shouldn’t the compiler be better at strictness analysis than me?
The Python code is a fold, too; there’s just not a clean way to write a fold that augments a dict in Python. So instead we have a loop that somewhat obfuscates the fact that we’re folding.
Huh? The reduce() function has been part of Python for as long as I’ve been aware of the language, and the following works fine for me:
def updateWith(f, default):
def curried(dict,index):
if index in dict:
dict[index] = f(dict[index])
else: dict[index] = default
return dict
return curried
print reduce(updateWith(lambda x: x + 1, 1), [1,2,3,4,5,1,3,7,0], {} This prints
{0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 1, 7: 1}
as expected.It would probably be considered poor style, though, and I believe there’s an ongoing argument in the Python community as to whether reduce() and map() should be removed from the language entirely: I guess they don’t buy your argument that code using map and fold is clearer than the equivalent code using loops
Other than that, great article!
Miles, I know about reduce, but as I pointed out and you confirm, the code you end up writing to use it to update a dict is messy.
Ah, OK. Though I think it’s only messy because you have to define updateWith() yourself (and because defaultdicts aren’t available on the Python interpreter I have to hand). If you want to do this kind of thing with folds rather than loops, then write updateWith once, stick it in a module somewhere, and import it when you need it
“… not implementing the approach that would be best suited to Haskell (dynamic programming).”
Could you be convinced to do that? Would *love* to see that explained as nicely as you have explained this implementation.
I really enjoyed your article.