This post is part of a series on Programming Language Design in Scala (PLDS). Click here to see the rest of the posts in the series.
In September 2014 I stumbled upon the Scala Parser Combinators library and ended up playing around with implementing a small programming language in Scala. Although the language itself was more or less useless, I thought that the process of designing and implementing it (and later extending it with more features) was a pretty fun activity. This then gave me the idea to start this blog as a place for programming and computer science related topics that I find interesting. My first blog post was supposed to be a short tutorial or introduction to Scala Parser Combinators based around the implementation of a small programming language. Because of a lack of motivation, ideas and time, my first post instead ended up being about an entirely different project, and my post about parser combinators remained an unfinished draft for more than six months.
Now I've finally found the energy to complete this project, or at least the first part of it. My idea is to write a series of blog posts about the design and implementation of programming languages using practical examples in Scala (and perhaps other languages in the future). The first two posts will be about the syntactic analysis part of a language implementation, i.e. parsing the source code. This very first post will briefly introduce language design, formal definition of syntax, and how to implement a scanner in Scala. A scanner (also known as a tokenizer or a lexer) is a simple program that reads a sequence of characters (the source code) and outputs a sequence of tokens, i.e. a sequence of syntactical components. This process is known as lexical analysis (and also sometimes tokenization), and is usually what precedes the actual parsing step, where a parse tree or an abstract syntax tree is constructed from the sequence of tokens. The parsing step is described in the next post, so for now we'll just be looking at the scanner.
This post should serve as an introduction to—and shouldn't require any prior knowledge of—language design and Scala, although a basic understanding of programming and programming languages is a prerequisite. I also recommend reading some tutorials if you're interested in learning more about Scala.
Language definition and grammar
The first step of any good language implementation is the specification. At this stage we formally specify the syntax of the language. The language we'll be implementing is a simple purely functional programming language similar in appearance to Haskell or ML and its dialects. I'll be refering to the language as PLDS (Programming Language Design in Scala), and—since I'll extend it in later posts—I'll refer to this iteration as PLDS/1.
Syntax
The formal syntax of PLDS can be expressed as a context-free grammar. A grammar consists of a list of production rules that specify how strings of a language may be formed. Each rule consists of a non-terminal on the left (i.e. the name of the rule) and any number of operators, terminals and non-terminals on the right. The following notational conventions (a variant of Extended Backus–Naur Form) are used:
A
a non-terminal, the name of a production rulea
a non-terminal, the name of token (defined in the next section)"a"
a terminal: a keyword, operator or punctuation tokena b
concatenation,a
followed byb
a | b
alternation,a
orb
{a}
repetition, zero or more repetitions ofa
[a]
option, zero or onea
(a)
group, used to override precedencea - b
exception,a
but notb
Using these conventions the overall structure of PLDS program is defined as:
Program ::= {Definition}
Definition ::= "let" name "=" Expr
Thus a program is a sequence of definitions, and a single definition assigns the value of an expression to a name. Expressions are defined below:
Expr ::= "let" name "=" Expr "in" Expr
| "\\" name "->" Expr
| "if" Expr "then" Expr "else" Expr
| Comparison
Comparison ::= Comparison ("==" | "!=) AddSub
| AddSub
AddSub ::= AddSub ("+" | "-") MultDiv
| MultDiv
MultDiv ::= MultDiv ("*" | "/") Application
| Application
Application ::= Atomic {Atomic}
Atomic ::= "(" Expr ")"
| name
| literal
Expressions consist of let-expressions, λ-abstractions, if-expressions, comparisons, and arithmetic operations.
The above grammar defines the valid structure of a program in PLDS, and we will look more into it when we implement the parser. But for now we will move on to the lexical grammar, which is what we are interested in when implementing the scanner.
Lexical grammar
In the lexical grammar (or token grammar) we ignore the syntax/structure of a program, and instead just look at the individual tokens. The lexical grammar will look similar to the above context-free-grammar, but with two additional conventions:
"a" | ... | "z"
a range of characters from a to zany
any character
program ::= {skip | token}
token ::= keyword | operator | punctuation | name | literal
This means that, from the perspective of the lexical grammar, a program is a sequence of tokens and skippable characters, and a token is either a keyword, an operator, a punctuation character, an identifier, or a literal. Things that are skipped are defined as:
skip ::= {whitespace | comment}
whitespace ::= " " | "\t" | "\r" | "\n"
comment ::= "//" {any - "\n"}
| "/*" {any - "*/"} "*/"
Thus whitespace characters, such as spaces and newlines, between tokens, as well as C-style single- and multiline comments, are ignored.
keyword ::= "let" | "in" | "if" | "then" | "else"
operator ::= "->" | "==" | "!=" | "\\" | "+" | "-"
| "/" | "*" | "="
punctuation ::= "(" | ")"
The above rules define the reserved keywords, operators and punctuation characters of PLDS.
name ::= letter {letter | digit | "_"} - keyword
letter ::= "a" | ... | "z"
| "A" | ... | "Z"
digit ::= "0" | ... | "9"
The last two types of tokens are literals, and in PLDS there are two types of literals: strings and numbers:
literal ::= number | string
number ::= digit {digit} ["." digit {digit}]
string ::= "\"" {char} "\""
char ::= any - ("\\" | "\"")
| "\\" any
One thing to note about this lexical grammar is that it is ambiguous, meaning that a string of input characters may be tokenized in more than one way. Consider for example the string "letabc15": Depending on how we look at it, we may interpret it as either three tokens (let
(keyword), abc
(name), and 15
(number)), two tokens (let
(keyword) and abc15
(name)), or one token (letabc15
(name)). We'll therefore have to define an additional constraint to our lexical grammar: Reading from left to right, we accept the longest possible token, meaning that the example "letabc15" would indeed result in only one token: letabc15
(name).
Setting up the Scala project
The easiest way to get Scala up and running is probably to install an IDE with Scala support. There are several options available, e.g. Eclipse, IntelliJ IDEA, and NetBeans. As an alternative to a big IDE, you can also use your favourite text editor and run Scala from the command-line, see for instance this guide.
A quick way to get started is to use SBT, a build tool for Scala and Java projects (both Eclipse and IDEA have support for SBT projects). On Linux you should be able to find SBT in your favourite package manager, while on Windows you can use the installer. To create a new project using SBT: In an empty directory create a file named build.sbt
with the following content (note: the blank lines are required):
name := "plds"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.scala-lang.modules" %% "scala-parser-combinators" % "1.0.3"
Then run the command sbt
in the same directory. After running SBT the message [info] Set current project to plds
should appear followed by a >
prompt. Type update
and press enter. This will download the correct version of Scala and the parser combinators library.
SBT will look for Scala source files in both the root of the project directory, and in the src/main/scala
directory tree. For small projects you could probably just drop your source files in the root, but to keep things clean let's create the directory src/main/scala
for our Scala source code. For now we'll only create one file called scan.scala
, which will contain our scanner plus some additional classes. At the top of the file you can specify the name of the package following the usual Scala/Java naming conventions (i.e. reverse vendor domain followed by project name followed by optional subpackages):
package dk.nielssp.plds
We'll also need to import a number of classes:
import scala.util.parsing.combinator.RegexParsers
import scala.util.parsing.input.{Position, Positional}
import scala.collection.immutable.HashSet
import scala.io.StdIn
Representing tokens
The output of our scanner will be a sequence of tokens, so before we continue, we will create a flat hierarchy of case classes to represent the different types of tokens in PLDS. In the file scan.scala
add the following:
abstract sealed class Token extends Positional
case class StringToken(value: String) extends Token
case class NumberToken(value: Double) extends Token
case class NameToken(name: String) extends Token
case class KeywordToken(keyword: String) extends Token
case class PunctuationToken(punctuation: String) extends Token
case class OperatorToken(operator: String) extends Token
Case classes in Scala can be used similarly to tagged unions in other functional languages and being sealed
means that the Token
class can only be extended by other classes defined in the same file. Each token type includes a value-field as part of its constructor, meaning that for instance NumberToken(3.5)
constructs a number token for the number 3.5. The Positional
trait, which our Token
class extends, is part of the parser combinators library and makes it possible to add positional information (line and column) to each token. This extra information makes error messages generated by the parser much more helpful to the programmer.
Implementing the scanner
To implement the actual scanner we'll use the trait RegexParsers
, which we imported from the parser combinators library earlier.
Traits in Scala are like interfaces in Java, but unlike interfaces, which only contain method signatures, traits can contain full method definitions. Thus the RegexParsers
trait mentioned above contains a number of useful methods for implementing scanners and parsers. The parser combinators library contains other traits, such as the JavaTokenParsers
trait specifically for parsing Java tokens, and the Parsers
trait which RegexParsers
is a subclass of. RegexParsers
extends Parsers
with additional support for parsing strings of characters and using regular expressions, which is perfect for implementing a scanner. To use the trait we will create an object called scan
:
object scan extends RegexParsers {
override def skipWhitespace = false
}
A singleton object in Scala is essentially a class with only one instance. We first override the method skipWhitespace
with the value false
, which means that the scanner won't automatically skip whitespace characters.
Next we'll be translating the lexical grammar into Scala code, and because we are using a parser combination library that allows us to combine parsers using familiar looking operators (such as |
), this is an almost trivial task. In the following sections we'll use parser combinators to create methods for each of the rules in our lexical grammar, and each of those methods should be part of the scan
object.
Literals
Let's start from the bottom by parsing literals. For instance, here's a parser that parses string characters:
def char: Parser[Char] = ("""[^"\\]""".r | '\\' ~> ".".r) ^^ { _.head }
The above code should look somewhat similar to the token grammar for the char
token, but let's break it into smaller pieces to look at what's actually going on. The outermost operator is ^^
which is a combinator that takes the result of the parser on the left and applies a function to it. In the above example the result of the type of the parser on the left is Parser[String]
and we apply the anonymous function { _.head }
to the result. If you are unfamiliar with Scala syntax, this may look a bit weird, but it is basically syntactical sugar for a lambda expression (x => x.head)
. Thus the function returns the first character of the input string thereby converting the string parser on the left to a parser of type Parser[Char]
.
In the parser on the left ("""[^"\\]""".r | '\\' ~> ".".r
) the outermost operator is |
since it has lower precedence than ~>
. |
is the alternation combinator so it has the same basic meaning as |
in the grammar, i.e. first try the parser on the left and if that fail try the parser on the right. The left side parser parses strings using a regular expression (a string is converted to a regular expression using the .r
-method). The regular expression is automatically converted to a parser because we are using the RegexParsers
trait. The [^"\\]
expression matches any single character that isn't a double quotation mark or a backslash.
On the other side of the |
combinator we have '\\' ~> ".".r
which uses the ~>
combinator. ~>
is used for sequential composition of parsers (i.e. concatenation) but only keeps the result on the right, so the parser in question expects a backslash followed by any character (the .
regular expression) but only outputs the second character.
To sum it all up we have a parser that parses any character except for double quotation marks and backslashes unless they are preceded by a backslash. We use this parser to parse strings:
def string: Parser[Token] =
"\"" ~> rep(char) <~ "\"" ^^ (chars => StringToken(chars.mkString))
In the above definition we see some familiar faces, namely ~>
and ^^
. The new operator <~
is also a combinator for sequential composition, but unlike ~>
it keeps the result of the left side parser. The rep
function is the combinator for repetition, so rep(char)
accepts zero or more characters. Thus the string parser accepts a double quotation mark followed by zero or more characters followed by a double quotation mark, and it throws away the quotation marks. The result of the rep
combination is of type List[Char]
so we use the ^^
combinator to first create a string from the characters using mkString
, then create the StringToken
object with the resulting string.
Next we can parse number literals using a simple regular expression:
def number: Parser[Token] =
"""\d+(\.\d+)?""".r ^^ (s => NumberToken(s.toDouble))
Here we convert the resulting string to a double using the toDouble
method, before creating the number token.
Operators, punctuation, names and keywords
Similarly operators and punctuation characters are parsed by regular expressions:
def operator: Parser[Token] = """->|==|!=|\\|\+|-|/|\*|=""".r ^^ OperatorToken
def punctuation: Parser[Token] = """\(|\)""".r ^^ OperatorToken
Since we don't need to do any additional modifications to the result of the regular expression we can simple use the name of the token class as the right side of the ^^
combinator in order to create the tokens.
For names and keyword we're gonna simplify the implementation a bit. Because a keyword consists of only letters, we can implement our name
-parser such that it checks if the parsed name is a keyword. We do this by first defining a hash set of keywords in PLDS:
val keywords = HashSet("let", "in", "if", "then", "else")
def name: Parser[Token] = """[a-zA-Z][a-zA-Z0-9_]*""".r ^^ {
case s if keywords.contains(s) => KeywordToken(s)
case s => NameToken(s)
}
This way we have also solved the ambiguity of parsing a string such as "letlet": Since the regular expression will match the entire string and "letlet" isn't part of the hash set, a NameToken("letlet")
will be returned instead of two keywords.
Comments and whitespace
The parsers for single line and multiline comments are implemented below:
def singleComment: Parser[Unit] = "//" ~ rep(not("\n") ~ ".".r) ^^^ Unit
def multiComment: Parser[Unit] = "/*" ~ rep(not("*/") ~ "(?s).".r) ~ "*/" ^^^ Unit
Here we introduce the ~
, not
, and ^^^
combinators. The not
combinator is used to negate the result of a parser while not consuming any input, e.g. not("\n")
fails if the next character is a line break, but it does not consume that line break. The ^^^
is similar to the previously introduced ^^
combinator except that it doesn't care about the result of the parser on the left and merely replaces the output with the value on the right. In this case we don't care about the content of the comments so we return Unit
, i.e. nothing or "void". The ~
is another combinator for sequential composition but, unlike the others, keeps the result on both the left side and the right side. In this case it doesn't make any difference since we replace the final result with Unit
.
We combine the comment parsers with a whitespace parser to implement the skip
-parser:
def skip: Parser[Unit] = rep(whiteSpace | comment) ^^^ Unit
def comment: Parser[Unit] = singleComment | multiComment
whiteSpace
is a parser provided by RegexParsers
that parses whitespace characters.
Tokens and program structure
To parse a token we simply combine the token parsers we have previously implemented:
def token: Parser[Token] =
positioned(operator | punctuation | name | number | string)
We surround the entire thing with positioned()
which inserts positional information in each token (this is why our Token
-class uses the Positional
-trait).
Recall that the scanner views the program as a sequence of tokens (and the token grammar defines a program as {skip | token}
). There are however two issues we have to address when implementing this: First, we can't just do rep(skip | token)
since skip
and token
are not of the same type (skip
has type Parser[Unit]
and token
has type Parser[Token]
). Instead we do skip ~> rep(token <~ skip)
which essentially (because skip
also accepts empty strings) does the same thing while returning a list of tokens. Second, the scanner must fail in a meaningful way if it encounters a character it doesn't understand. We implement this by adding an end-of-file parser with a custom failure message:
def program: Parser[List[Token]] = skip ~> rep(token <~ skip) <~ eof
def eof: Parser[String] = "\\z".r | failure("unexpected character")
failure()
is a parser that always fails, so when combined with another parser using the |
combinator, we can set a custom failure message for when the other parser fails. In this case we use the regular expression "\z" to match the end of the file. As a result the program
-parser will fail if it encounters something that isn't a valid token, whitespace, comment or the end of the file.
Applying he scanner
The final method of the scanner takes an input string and applies the program
-parser to it:
def apply(input: String): List[Token] = parseAll(program, input) match {
case Success(result, _) => result
case NoSuccess(msg, next) => throw SyntaxError(msg, next.pos)
}
We use the parseAll
function to apply the parser to the input, and then we use pattern matching on the result (of type ParseResult[List[Token]]
). If the parser was successful we return the resulting list of tokens. Otherwise we throw the exception implemented below (place it somewhere outside of the scan
object):
case class SyntaxError(msg: String, pos: Position) extends Exception(msg)
Aside from the error message it also includes the position of the next character, which is useful to a programmer trying to find an error among hundreds of lines of code.
The apply
-method has special meaning in Scala in that it allows us to call the scan
-object as if it was a function. For instance, in order to generate a list of tokens from the string "let a = 42"
we simply call scan("let a = 42")
.
Testing it
To test our scanner we'll implement a simple RSPL (a read–scan–print loop, similar to an REPL). It will accept lines of code from the standard input, apply the scanner, and then print the result list of tokens. Simply insert the following code at the bottom of scan.scala
:
object rspl {
def main(args: Array[String]) {
while (true) {
val input = StdIn.readLine("> ")
try {
if (input != null) {
val tokens = scan(input)
println(tokens)
}
}
catch {
case e: SyntaxError =>
println(s"Syntax error: ${e.msg} on line ${e.pos.line}, column ${e.pos.column}")
if (e.pos.line > 0) {
println(input.lines.toList(e.pos.line - 1))
println("-" * (e.pos.column - 1) + "^")
}
}
}
}
}
To run it simply type run
in the SBT console (or right click on the main
method and choose "Run" if you're using an IDE).
If everything went well, you should be presented with another >
prompt in which you can type in lines of code for the scanner to scan:
> let a = 42
List(KeywordToken(let), NameToken(a), OperatorToken(=), NumberToken(42.0))
We see that the scanner correctly identifies a keyword, a name, an operator, and a number. Since the scanner doesn't concern itself with the syntax, we can type in any combination of valid tokens:
> letlet("string"3.0
List(NameToken(letlet), PunctuationToken((), StringToken(string), NumberToken(3.0))
Comments and whitespace are ignored and won't appear in the token list:
> 24 + /* comment */ 12 // another comment
List(NumberToken(24.0), OperatorToken(+), NumberToken(12.0))
The scanner fails if it encounters unkown characters (outside of a string literal):
> a + b ^ 2
Syntax error: unexpected character on line 1, column 7
a + b ^ 2
------^
It will also fail if we forget to end a string literal:
> let s = "Hello, World!
Syntax error: `"' expected but end of source found on line 1, column 23
let s = "Hello, World!
----------------------^
Conclusion
The scanner is the first step towards implementing a programming language. In fact, a scanner is (often) all you need if all you want to do is to implement simple syntax highlighting for a programming language.
Since the scanner is where we've implemented the token grammar of our programming language, it is also were the keywords, literals, special characters, etc. of the language are defined. Thus, if we later wish to for instance add a new keyword to the language, we'll have to modify the scanner. A possible improvement to the lexical grammar could be the addition of more numeric literals, such as hexadecimal literals (e.g. 0xF30A
) or optional exponents (e.g. 0.5e7
). Another possible improvement could be the addition of more escape sequences to string literals, such as "\n"
for line breaks or "\u00C6"
for Unicode characters.
Anyway, this is the end of part 1 of what will hopefully be a series on Programming Language Design in Scala (and perhaps it will branch off into other languages in the future). The post ended up being quite a bit longer that I had originally wanted, and I am not entirely sure whether I managed to make the topic easy to understand.