Posts tagged ‘JSON’

Parsing JSON with Scala

In the previous post, Arnar wrote about how to do a JSON parser in Haskell using parsec. He also mentioned that I had shown him a similar parser written in Scala, but I must admit that I stole it from the upcoming book Programming in Scala. The book is very good and I recommend it to anyone that is interested in learning Scala.

Before I get on with the post, I’m gonna spend a few words on explaining Scala. It is a strictly typed language with a very powerful type inference system. Scala runs on the Java Virtual Machine, a port or the .NET CLR is available. It is a mix between functional and object oriented languages and is almost fully interoperable with Java. Scala replaces some of the constructs and mechanisms that you normally have in Java with more general things. Rather than using anonymous classes you have closures, there are no interfaces, but you have traits and so on. Scala also adds implicit conversions, operator overloading, mutable and immutable types and some more goodies.

Scala access to all of the class libraries that come with Java and everything you throw into its classpath. It also comes with its own class library which has collections, tuples, parsers, xml types, etc etc. Relevant for this post, maps are written as Map(key -> value), tuples as (first, second, third) and lists as List(1, 2, 3).

The first thing that is going to look weird is that the type of things appears after the name or definition and there is a keyword in front.

val i = 3
val j: Int = i
 
def hello: String = "Hello world"

The val keyword placed in front of the variables is to tell the compiler that an immutable value is coming up and var is the mutable counterpart. Methods are defined with the def keyword. The first two lines show the type inference system at work, since you have the number 3 it can only be an integer. If there is any doubt about the type during compilation, the compiler will complain and exit. The same goes for the return type of the “hello” method, the type String is specified but can be omitted. Since the method is a oneliner the = "Hello world" is a short hand for { return "Hello world" }.

Back to the JSON parser, the parsing combinator package has to be imported

import scala.util.parsing.combinator._

In Scala _ is a sequential any reference, or like in this case as * in Java imports. The parser itself is a token based parser, and extends from a trait called JavaTokenParsers which includes parsing for identifiers, numbers and strings.

class JSON extends JavaTokenParsers { 
        def obj: Parser[Map[String, Any]] = 
                "{"~> repsep(member, ",") <~"}" ^^ (Map() ++ _) 
 
        def arr: Parser[List[Any]] = 
                "["~> repsep(value, ",") <~"]" 
 
        def member: Parser[(String, Any)] = 
                stringLiteral~":"~value ^^ 
                        { case name~":"~value => (name, value) } 
 
        def value: Parser[Any] = ( 
                obj 
                | arr 
                | stringLiteral 
                | floatingPointNumber ^^ (_.toInt) 
                | "null" ^^ (x => null) 
                | "true" ^^ (x => true) 
                | "false" ^^ (x => false) 
                ) 
}

The root of the parse tree is defined in the method value which returns a Parser[Any]. Since the short hand notation is being used and it expects are Parser[Any], it uses implicit conversions for the Parser class and from there you get the ( ... | ... ) notation which provides you with parser alternatives. The weird ^^ thing is result conversions, so floatingPointNumber ^^ (_.toInt) means, take that floating number and return it as an integer using the wildcard _. repsep is a shorthand for interleaved repitition. Lastly the wavy dudes with ~, <~ and ~> denotes sequential composition and with the less and more than symbols it keeps to the left or right.

To run the parser and do something meaningful with it we read in a file pass it on to the parser and print the output

import java.io.FileReader 
 
object JSONTest extends JSON { 
        def main(args: Array[String]) { 
                val reader = new FileReader(args(0)) 
                println(parseAll(value, reader)) 
        } 
}

Different from Java, Scala does not have static types per se but allows you to do static like things by using a object instead of a class. Also notice how the FileReader is being imported directly from the Java class libraries.

After we’ve compiled the code and run it on the same file JSON file as Arnar provided in his example, we are presented with

[22.2] parsed: 
    Map(”glossary” -> 
        Map(”title” -> “example glossary”, “GlossDiv” -> 
            Map(”title” -> “S”, 
                “GlossList” -> 
                Map(”GlossEntry” -> 
                    Map(”Acronym” -> “SGML”, 
                        “Abbrev” -> “ISO 8879:1986″, 
                        “GlossSee” -> “markup”, 
                        “GlossTerm” -> “Standard Generalized Markup Language”, 
                        “ID” -> “SGML”, 
                        “GlossDef” -> 
                            Map(”para” -> 
                                “A meta-markup language, used to create markup 
                                languages such as DocBook.”, 
                                “GlossSeeAlso” -> List(”GML”, “XML”)), 
                        “SortAs” -> “SGML”)))))

I am to lazy to format this output nicely, but you get the picture. :) I noticed also that this JSON parser does not handle comments, which I leave as an exercise to my good readers.

I have been meaning to write something on Scala for quite some time now and now I was forced to since Arnar mentioned my crazy ranting. In the coming weeks I am hopefully going to show you some more niceness of Scala, such as implicit conversions, XML types, its functional nature, etc.

Parsing JSON with Haskell

Note: This post is literate Haskell, you can copy the whole text to a .lhs file and compile it with GHC.

A friend of mine was showing me a very simple and elegant parser for JSON, written in Scala. As an excercise, I decided to write one in Haskell, using the Parsec parser combinator library. Here’s a quick walkthrough.

First we need some imports,

> import qualified Data.Map as Map
> import Control.Monad (liftM)
> import Text.ParserCombinators.Parsec
> import qualified Text.ParserCombinators.Parsec.Token as T
> import Text.ParserCombinators.Parsec.Language

Now, to represent a JSON value, we use an algebraic data type. For simplicity, we’ll only worry about integers and not floating point numbers. The definition of the type is simple, to be concise we use single letter constructor names (not a good idea if you want to publish a proper library).

> data JSON = S String 
>           | I Integer 
>           | B Bool 
>           | L [JSON] 
>           | O (Map.Map String JSON)
>           deriving Show

Now, to make our life somewhat easier, we’ll make use of the facilities in the ParsecToken module. This module provides us with ways to create functions for generic lexers, which deal correctly with whitespace (and comments) in a language. The first step is to create a language definition, in our case it only specifies how JSON comments are delimited. From the definition we create a token parser which we call lexer.

> ldef = emptyDef { commentStart = "/*"
>                 , commentEnd = "*/"
>                 , commentLine = "//"
>                 }
> lexer = T.makeTokenParser ldef

We can now pass lexer to the convenience functions of ParsecToken. As is common with such parsers, we rebind their names in our module to spare us referring to lexer every time we use them.

> symbol        = T.symbol        lexer
> stringLiteral = T.stringLiteral lexer
> integer       = T.integer       lexer
> squares       = T.squares       lexer
> braces        = T.braces        lexer
> commaSep      = T.commaSep      lexer

And now the parser itself. The basic element is a JSON value. From now on I’ll include the type signatures to make things easier to follow, but they are of course not necessary.

> value :: Parser JSON
> value = liftM S stringLiteral
>     <|> liftM I integer
>     <|> liftM B boolean
>     <|> liftM L list
>     <|> liftM O object

A value is simply a choice between each of the basic types, a string, an integer, a boolean, a list of values or an object (which is simply a mapping from strings to values). We defer each to its own parser, and simply lift the value constructors to the parser monad. We get parsing of strings and integers completely for free from the ParsecToken module, but the other three we need to handle ourselves. The boolean case is simple,

> boolean :: Parser Bool
> boolean = do symbol "true"
>              return True
>       <|> do symbol "false"
>              return False

Surprisingly, the list is even simpler, due to the helpers from ParsecToken.

> list :: Parser [JSON]
> list = squares $ commaSep value

Parsing objects is also fairly simple, basically just like list except that instead of parsing a comma seperated list of values, we parse a comma seperated list of key/value pairs. Such a pair is handled by the nested keyValue parser.

> object :: Parser (Map.Map String JSON)
> object = liftM Map.fromList $ braces $ commaSep keyValue
>          where
>              keyValue = do key <- stringLiteral
>                            symbol ":"
>                            val <- value
>                            return (key, val)

And that’s it! To test this easily, we finally add a main action that simply parses a JSON value from standard input.

> main = getContents >>= parseTest value

Testing the parser

A simple test, including a comment in the middle:

arnarb-macbook:tmp arnarb$ runhaskell json.hs 
[1,"two",false,{"one":1, /* a comment here */ "two":2, "list":[]}]
< here I pressed ctrl-d to insert an EOF, rest is output >
L [I 1,S "two",B False,O (fromList [("list",L []),(”one”,I 1),(”two”,I 2)])]

As an more complex test, we run it on the first example from json.org (newlines and indent added to the output for clarity).

arnarb-macbook:tmp arnarb$ runhaskell json.hs 
{
  "glossary": {
    "title": "example glossary",
    "GlossDiv": {
      "title": "S",
      "GlossList": {
        "GlossEntry": {
          "ID": "SGML",
          "SortAs": "SGML",
          "GlossTerm": "Standard Generalized Markup Language",
          "Acronym": "SGML",
          "Abbrev": "ISO 8879:1986",
          "GlossDef": {
            "para": "A meta-markup language, used to create markup languages such as DocBook.",
            "GlossSeeAlso": ["GML", "XML"]
          },
          “GlossSee”: “markup”
        }
      }
    }
  }
}
< here I pressed ctrl-d to insert an EOF, rest is output >
O (fromList [
    ("glossary",O (fromList [
        ("GlossDiv",O (fromList [
            ("GlossList",O (fromList [
                ("GlossEntry",O (fromList [
                    ("Abbrev",S "ISO 8879:1986"),
                    ("Acronym",S "SGML"),
                    ("GlossDef",O (fromList [
                        ("GlossSeeAlso",L [S "GML",S "XML"]),
                        (”para”,S “A meta-markup language, used to create markup languages such as DocBook.”)
                    ])),
                    (”GlossSee”,S “markup”),
                    (”GlossTerm”,S “Standard Generalized Markup Language”),
                    (”ID”,S “SGML”),
                    (”SortAs”,S “SGML”)
                ]))
            ])),
            (”title”,S “S”)
        ])),
        (”title”,S “example glossary”)
    ]))
])

Finally, we should note that many people have done this before :)

Edits:

  • Added link to Scala version.
  • Small revisisons, thanks to kpreid on #haskell for pointing out braces and squares.
  • Stop lugging lexer around, thanks to Paczesiowa on reddit for useful suggestions.