Note: This post is literate Haskell, you can copy the whole text to a .lhs file and compile it with GHC.
A friend of mine was showing me a very simple and elegant parser for
JSON, written in Scala. As an excercise, I decided
to write one in Haskell, using the Parsec
parser combinator library. Here’s a quick walkthrough.
First we need some imports,
> import qualified Data.Map as Map
> import Control.Monad (liftM)
> import Text.ParserCombinators.Parsec
> import qualified Text.ParserCombinators.Parsec.Token as T
> import Text.ParserCombinators.Parsec.Language
Now, to represent a JSON value, we use an algebraic data type. For simplicity, we’ll only worry about
integers and not floating point numbers. The definition of the type is simple, to be concise we use single
letter constructor names (not a good idea if you want to publish a proper library).
> data JSON = S String
> | I Integer
> | B Bool
> | L [JSON]
> | O (Map.Map String JSON)
> deriving Show
Now, to make our life somewhat easier, we’ll make use of the facilities in the ParsecToken module. This module
provides us with ways to create functions for generic lexers, which deal correctly with whitespace (and comments)
in a language. The first step is to create a language definition, in our case it only specifies how JSON comments
are delimited. From the definition we create a token parser which we call lexer.
> ldef = emptyDef { commentStart = "/*"
> , commentEnd = "*/"
> , commentLine = "//"
> }
> lexer = T.makeTokenParser ldef
We can now pass lexer to the convenience functions of ParsecToken.
As is common with such parsers, we rebind their names in our module to spare us referring to lexer
every time we use them.
> symbol = T.symbol lexer
> stringLiteral = T.stringLiteral lexer
> integer = T.integer lexer
> squares = T.squares lexer
> braces = T.braces lexer
> commaSep = T.commaSep lexer
And now the parser itself. The basic element is a JSON value. From now on I’ll include the type
signatures to make things easier to follow, but they are of course not necessary.
> value :: Parser JSON
> value = liftM S stringLiteral
> <|> liftM I integer
> <|> liftM B boolean
> <|> liftM L list
> <|> liftM O object
A value is simply a choice between each of the basic types, a string, an integer, a boolean, a list of values
or an object (which is simply a mapping from strings to values). We defer each to its own parser, and simply
lift the value constructors to the parser monad.
We get parsing of strings and integers completely for free from the ParsecToken module, but the other three we
need to handle ourselves. The boolean case is simple,
> boolean :: Parser Bool
> boolean = do symbol "true"
> return True
> <|> do symbol "false"
> return False
Surprisingly, the list is even simpler, due to the helpers from ParsecToken.
> list :: Parser [JSON]
> list = squares $ commaSep value
Parsing objects is also fairly simple, basically just like list except that instead of parsing a
comma seperated list of values, we parse a comma seperated list of key/value pairs. Such a pair is handled
by the nested keyValue parser.
> object :: Parser (Map.Map String JSON)
> object = liftM Map.fromList $ braces $ commaSep keyValue
> where
> keyValue = do key <- stringLiteral
> symbol ":"
> val <- value
> return (key, val)
And that’s it! To test this easily, we finally add a main action that simply parses a JSON value from
standard input.
> main = getContents >>= parseTest value
Testing the parser
A simple test, including a comment in the middle:
arnarb-macbook:tmp arnarb$ runhaskell json.hs
[1,"two",false,{"one":1, /* a comment here */ "two":2, "list":[]}]
< here I pressed ctrl-d to insert an EOF, rest is output >
L [I 1,S "two",B False,O (fromList [("list",L []),(”one”,I 1),(”two”,I 2)])]
As an more complex test, we run it on the first example from json.org (newlines
and indent added to the output for clarity).
arnarb-macbook:tmp arnarb$ runhaskell json.hs
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
“GlossSee”: “markup”
}
}
}
}
}
< here I pressed ctrl-d to insert an EOF, rest is output >
O (fromList [
("glossary",O (fromList [
("GlossDiv",O (fromList [
("GlossList",O (fromList [
("GlossEntry",O (fromList [
("Abbrev",S "ISO 8879:1986"),
("Acronym",S "SGML"),
("GlossDef",O (fromList [
("GlossSeeAlso",L [S "GML",S "XML"]),
(”para”,S “A meta-markup language, used to create markup languages such as DocBook.”)
])),
(”GlossSee”,S “markup”),
(”GlossTerm”,S “Standard Generalized Markup Language”),
(”ID”,S “SGML”),
(”SortAs”,S “SGML”)
]))
])),
(”title”,S “S”)
])),
(”title”,S “example glossary”)
]))
])
Finally, we should note that many people
have
done
this
before :)
Edits:
- Added link to Scala version.
- Small revisisons, thanks to
kpreid on #haskell for pointing out braces and squares.
- Stop lugging
lexer around, thanks to Paczesiowa on reddit for useful suggestions.