Author Archive

Parsing JSON with Scala

In the previous post, Arnar wrote about how to do a JSON parser in Haskell using parsec. He also mentioned that I had shown him a similar parser written in Scala, but I must admit that I stole it from the upcoming book Programming in Scala. The book is very good and I recommend it to anyone that is interested in learning Scala.

Before I get on with the post, I’m gonna spend a few words on explaining Scala. It is a strictly typed language with a very powerful type inference system. Scala runs on the Java Virtual Machine, a port or the .NET CLR is available. It is a mix between functional and object oriented languages and is almost fully interoperable with Java. Scala replaces some of the constructs and mechanisms that you normally have in Java with more general things. Rather than using anonymous classes you have closures, there are no interfaces, but you have traits and so on. Scala also adds implicit conversions, operator overloading, mutable and immutable types and some more goodies.

Scala access to all of the class libraries that come with Java and everything you throw into its classpath. It also comes with its own class library which has collections, tuples, parsers, xml types, etc etc. Relevant for this post, maps are written as Map(key -> value), tuples as (first, second, third) and lists as List(1, 2, 3).

The first thing that is going to look weird is that the type of things appears after the name or definition and there is a keyword in front.

val i = 3
val j: Int = i
 
def hello: String = "Hello world"

The val keyword placed in front of the variables is to tell the compiler that an immutable value is coming up and var is the mutable counterpart. Methods are defined with the def keyword. The first two lines show the type inference system at work, since you have the number 3 it can only be an integer. If there is any doubt about the type during compilation, the compiler will complain and exit. The same goes for the return type of the “hello” method, the type String is specified but can be omitted. Since the method is a oneliner the = "Hello world" is a short hand for { return "Hello world" }.

Back to the JSON parser, the parsing combinator package has to be imported

import scala.util.parsing.combinator._

In Scala _ is a sequential any reference, or like in this case as * in Java imports. The parser itself is a token based parser, and extends from a trait called JavaTokenParsers which includes parsing for identifiers, numbers and strings.

class JSON extends JavaTokenParsers { 
        def obj: Parser[Map[String, Any]] = 
                "{"~> repsep(member, ",") <~"}" ^^ (Map() ++ _) 
 
        def arr: Parser[List[Any]] = 
                "["~> repsep(value, ",") <~"]" 
 
        def member: Parser[(String, Any)] = 
                stringLiteral~":"~value ^^ 
                        { case name~":"~value => (name, value) } 
 
        def value: Parser[Any] = ( 
                obj 
                | arr 
                | stringLiteral 
                | floatingPointNumber ^^ (_.toInt) 
                | "null" ^^ (x => null) 
                | "true" ^^ (x => true) 
                | "false" ^^ (x => false) 
                ) 
}

The root of the parse tree is defined in the method value which returns a Parser[Any]. Since the short hand notation is being used and it expects are Parser[Any], it uses implicit conversions for the Parser class and from there you get the ( ... | ... ) notation which provides you with parser alternatives. The weird ^^ thing is result conversions, so floatingPointNumber ^^ (_.toInt) means, take that floating number and return it as an integer using the wildcard _. repsep is a shorthand for interleaved repitition. Lastly the wavy dudes with ~, <~ and ~> denotes sequential composition and with the less and more than symbols it keeps to the left or right.

To run the parser and do something meaningful with it we read in a file pass it on to the parser and print the output

import java.io.FileReader 
 
object JSONTest extends JSON { 
        def main(args: Array[String]) { 
                val reader = new FileReader(args(0)) 
                println(parseAll(value, reader)) 
        } 
}

Different from Java, Scala does not have static types per se but allows you to do static like things by using a object instead of a class. Also notice how the FileReader is being imported directly from the Java class libraries.

After we’ve compiled the code and run it on the same file JSON file as Arnar provided in his example, we are presented with

[22.2] parsed: 
    Map(”glossary” -> 
        Map(”title” -> “example glossary”, “GlossDiv” -> 
            Map(”title” -> “S”, 
                “GlossList” -> 
                Map(”GlossEntry” -> 
                    Map(”Acronym” -> “SGML”, 
                        “Abbrev” -> “ISO 8879:1986″, 
                        “GlossSee” -> “markup”, 
                        “GlossTerm” -> “Standard Generalized Markup Language”, 
                        “ID” -> “SGML”, 
                        “GlossDef” -> 
                            Map(”para” -> 
                                “A meta-markup language, used to create markup 
                                languages such as DocBook.”, 
                                “GlossSeeAlso” -> List(”GML”, “XML”)), 
                        “SortAs” -> “SGML”)))))

I am to lazy to format this output nicely, but you get the picture. :) I noticed also that this JSON parser does not handle comments, which I leave as an exercise to my good readers.

I have been meaning to write something on Scala for quite some time now and now I was forced to since Arnar mentioned my crazy ranting. In the coming weeks I am hopefully going to show you some more niceness of Scala, such as implicit conversions, XML types, its functional nature, etc.

From Nand to Tetris

A classical problem with the way Computer Science is taught is that it does not connect the dots between the different layers of abstractions which build up modern computer applications. Now I am talking about everything from the hardware components up to multi layered software. People tend to focus on one of upper-most parts of the cake and get really good at that, without understanding what happens beneath. Some people are really good at writing code and solve complex domain problems without realizing why their application performs like a drunk mule. For example:

  • Doing some computationally intensive tasks, not understanding how CPU caches and memory architectures work.
  • Using fancy built in data structures in their favorite language, without knowing anything about the different implementations, like the difference between ArrayList vs. LinkedList.
  • Dive into multi-threading, thinking it is easy as one-two-three.
  • Thinking that by having garbage collection means no memory leaks.

The courses I have had in my education have tackled each of the layers in a computer very well, but have generally lacked a focus on what the bigger picture is and how things relate. Therefore I got really interested and excited when I saw this clip where Shimon Schocken from Efi Arazi School of Computer Science explains a course and book he came up with along with Noam Nisan. The point of the course is to have students learning about all the different abstractions by starting at building logic gates from nothing but Nand components and ending up with a computer game like Tetris. I highly recommend everyone to take a look at this video, and as a bonus, the entire thing can be done for free with the tools and exercises from their homepage.