An improved spelling corrector in HaskellNovember 01, 2009 at 11:09 AM | categories: Uncategorized | View Comments
My previous post came up on the Haskell Reddit yesterday, and I got some great suggestions for improvements to the program, which I've checked into GitHub.
fromListWithappears to be specifically designed for aggregating values that share a common key (in .NET terminology it's the
I'll commonly use a construct like
readFile "big.txt" >>= return . train . lowerWords, to bind a monad value
readFile "big.txt"to a function
(train . lowerWords), then produce a new monad value with
return. A shorter way to write this is
fmap (train . lowerWords) (readFile "big.txt"), or, with an import from Control.Applicative,
(train . lowerWords <$> readFile "big.txt").
You can replace the lambda syntax
(\w -> Map.lookup w nwords)with an operator section,
('Map.lookup' nwords). You might see this more commonly as
(+1)in place of
(\x -> x + 1); it's the same thing. Edit: the challenging Markdown syntax means that you'll have to imagine the single quotes around
Map.lookupreplaced with backticks, `.
I should have realised this at the time, but it didn't occur to me that you can use full pattern-matching syntax in the right-hand side of a Haskell list comprehension. If the match fails on some element then that element is filtered out. We can use this to improve on the Python version of the
edits1 function: whereas the Python version combines its
b subscripts with an
if b check, the Haskell version can use pattern matching to do both. We can also use the
tails functions to build substrings instead of brute-force applications of
# Python def edits1(word): s = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in s if b] transposes = [a + b + b + b[2:] for a, b in s if len(b)>1] replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] inserts = [a + c + b for a, b in s for c in alphabet] return set(deletes + transposes + replaces + inserts) > -- Haskell > edits1 word = let s = zip (inits word) (tails word) > deletes = [ a ++ y | (a, _:y ) <- s ] > transposes = [ a ++ y:x:z | (a, x:y:z) <- s ] > replaces = [ a ++ c:y | (a, _:y ) <- s, c <- alphabet ] > inserts = [ a ++ c:x | (a, x ) <- s, c <- alphabet ] > in Set.fromList $ concat [ deletes, transposes, replaces, inserts ]
edits1 function is the densest one in both the Python and Haskell versions. It's the one where the intent of the code was least clear to me when I originally saw it, probably because of the extra clutter in the list comprehensions. In the latest Haskell revision it's more obvious what's going on.
I've consistently used the names
z for the first three parts of the word's suffix. Because they consistently stand for the same thing, any inconsistencies jump out at you when you read the code:
a ++ ystands out because
xis missing; it's been deleted. (
xisn't even mentioned in the pattern match.)
a ++ y:x:zstands out because
yare the wrong way round; we're transposing them
a ++ c:ystands out because we've got
xhas been replaced. (Again,
xhas dropped out of the pattern match.)
a ++ c:xhas
cin front of
x; it's been inserted
After I'd written the first Haskell version I was skeptical. The Python version uses some typical Python idioms, like
if b for checking for empty strings, using the same syntax for accessing lists and sets, and overloading the
or operator to look for the first non-empty set, and Haskell doesn't have these same shortcuts. However the latest set of revisions makes the the core algorithm in the Haskell version more readable than the Python equivalent.