An improved spelling corrector in Haskell

November 01, 2009 at 11:09 AM | categories: Uncategorized | View Comments

My previous post came up on the Haskell Reddit yesterday, and I got some great suggestions for improvements to the program, which I've checked into GitHub.

  • Map.fromListWith replaces List.foldl' and Map.insertWith'. fromListWith appears to be specifically designed for aggregating values that share a common key (in .NET terminology it's the GroupBy function).

  • I'll commonly use a construct like readFile "big.txt" >>= return . train . lowerWords, to bind a monad value readFile "big.txt" to a function (train . lowerWords), then produce a new monad value with return. A shorter way to write this is fmap (train . lowerWords) (readFile "big.txt"), or, with an import from Control.Applicative, (train . lowerWords <$> readFile "big.txt").

  • You can replace the lambda syntax (\w -> Map.lookup w nwords) with an operator section, ('Map.lookup' nwords). You might see this more commonly as (+1) in place of (\x -> x + 1); it's the same thing. Edit: the challenging Markdown syntax means that you'll have to imagine the single quotes around Map.lookup replaced with backticks, `.

I should have realised this at the time, but it didn't occur to me that you can use full pattern-matching syntax in the right-hand side of a Haskell list comprehension. If the match fails on some element then that element is filtered out. We can use this to improve on the Python version of the edits1 function: whereas the Python version combines its b[0] subscripts with an if b check, the Haskell version can use pattern matching to do both. We can also use the inits and tails functions to build substrings instead of brute-force applications of take and drop.

# Python
def edits1(word):
   s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in s if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]
   inserts    = [a + c + b     for a, b in s for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

> -- Haskell
> edits1 word = let s = zip (inits word) (tails word)
>                   deletes    = [ a ++ y     | (a, _:y  ) <- s ]
>                   transposes = [ a ++ y:x:z | (a, x:y:z) <- s ]
>                   replaces   = [ a ++ c:y   | (a, _:y  ) <- s, c <- alphabet ]
>                   inserts    = [ a ++ c:x   | (a, x    ) <- s, c <- alphabet ]
>                in Set.fromList $ concat [ deletes, transposes, replaces, inserts ]

The edits1 function is the densest one in both the Python and Haskell versions. It's the one where the intent of the code was least clear to me when I originally saw it, probably because of the extra clutter in the list comprehensions. In the latest Haskell revision it's more obvious what's going on.

I've consistently used the names x, y and z for the first three parts of the word's suffix. Because they consistently stand for the same thing, any inconsistencies jump out at you when you read the code:

  • a ++ y stands out because x is missing; it's been deleted. (x isn't even mentioned in the pattern match.)
  • a ++ y:x:z stands out because x and y are the wrong way round; we're transposing them
  • a ++ c:y stands out because we've got c instead of x; x has been replaced. (Again, x has dropped out of the pattern match.)
  • a ++ c:x has c in front of x; it's been inserted

After I'd written the first Haskell version I was skeptical. The Python version uses some typical Python idioms, like if b for checking for empty strings, using the same syntax for accessing lists and sets, and overloading the or operator to look for the first non-empty set, and Haskell doesn't have these same shortcuts. However the latest set of revisions makes the the core algorithm in the Haskell version more readable than the Python equivalent.

Read and Post Comments