Package takeover: indents
Published on December 22, 2016 under the tag haskell
Parsers are one of Haskell’s indisputable strengths. The most well-known library is probably Parsec. This parser combinator library has been around since at least 2001, but is still widely used today, and it has inspired new generations of general purpose parsing libraries.
Parsec makes it really easy to prototype parsers for certain classes of
grammars. Lots of grammars in use today, however, are whitespace-sensitive.
There are different approaches for dealing with that. One of the most commonly
used approaches is to add explicit INDENT
and DEDENT
tokens. But that
usually requires you to add a separate lexing phase – not a bad idea by itself,
but a bit annoying if you are just writing a quick prototype.
That is why I like the indents package – it sits in a sweet spot because it is a straightforward package that allows you turn any Parsec parser into an indentation-based one without having to change too many types.
It offers a bunch of semi-cryptic operators like <+/>
and <*/>
which I would
personally avoid in favor of their named variants, but other than that I would
consider it a fairly “easy” package.
Unfortunately, I found a few bugs an inconveniences in the old package. One interesting bug would allow failing branches of the parse to still affect the indentation’s internal state, which is very bad 1. Additionally, the package fixed the underlying monad, which prevented you from using transformers.
Because I didn’t want to confuse people by creating yet another package, I took over the package which is a very smooth process nowadays. I can definitely recommend this to anyone who discovers issues like these in unmaintained packages. The hackage trustees are doing great and valuable work there.
I have now uploaded a new version which fixes these issues. To celebrate that, let’s create a toy parser for indentation-sensitive taxonomies such as the big tea taxonomy 2:
tea
green
korean
pucho-cha
chung-cha
vietnamese
snow-green-tea
japanese
roasted
...
black
georgian
traditional
caravan-blend
african
kenyan
tanzanian
...
We need some imports to get rolling. After all, this blogpost is a literate
haskell file which can be loaded in GHCi
.
import Control.Applicative ((*>), (<*), (<|>))
import qualified Text.Parsec as Parsec
import qualified Text.Parsec.Indent as Indent
We just store a single term in the category as a String
.
type Term = String
A taxonomy is then recursively defined as a Term
and its children taxonomies.
data Taxonomy = Taxonomy Term [Taxonomy] deriving (Eq, Show)
A parser for a term is easy. We just parse an identifier and then skip the spaces following that.
pTerm :: Indent.IndentParser String () String
=
pTerm <* Parsec.spaces
Parsec.many1 allowedChar where
= Parsec.alphaNum <|> Parsec.oneOf ".-" allowedChar
In the parser for a Taxonomy
, we use the indents
library. withPos
is used
to “remember” the indentation position. After doing that, we can use
combinators such as indented
to check if we are indented past that point.
pTaxonomy :: Indent.IndentParser String () Taxonomy
= Indent.withPos $ do
pTaxonomy <- pTerm
term <- Parsec.many $ Indent.indented *> pTaxonomy
subs return $ Taxonomy term subs
Now we have a simple main to function to put it all together;
readTaxonomy :: FilePath -> IO Taxonomy
= do
readTaxonomy filePath <- readFile filePath
txt let errOrTax = Indent.runIndentParser parser () filePath txt
case errOrTax of
Left err -> fail (show err)
Right tax -> return tax
where
= pTaxonomy <* Parsec.eof parser
And we can verify that this works in GHCi:
*Main> readTaxonomy "taxonomy.txt"
Taxonomy "tea" [Taxonomy "green" [Taxonomy "korean" [...
*Main>
Special thanks to Sam Anklesaria for writing the original package.
The interesting tea taxonomy can be found in this blogpost: https://jameskennedymonash.wordpress.com/mind-maps/amazing-tea-taxonomy/.↩︎