a little madness

A man needs a little madness, or else he never dares cut the rope and be free -Nikos Kazantzakis

Zutubi

ANTLR By Example: Part 3: Parsing

Introduction

We finished Part 2: Lexical Analysis with a working lexer. The lexer splits character streams into streams of tokens such as “changed.by.me”, “(” and “not”. In this part, we will create a parser that can turn these token streams into an abstract syntax tree (AST).

Abstract Syntax Trees (ASTs)

An AST captures the structure of the expression being parsed. Expressions are recursive by nature: they are made up of one or more sub-expressions, which in turn have sub-expressions of their own. Such a structure is naturally modelled as a tree, where each sub-tree models a sub-expression. The tree also captures the precedence of operators: a higher-precedence operator will bind more tightly and become a lower node in the tree. For example, the expression “changed or failed and not state.change” may be represented in tree form:

    changed
or
        failed
    and
        not
            state.change

Parsers

Defining the Parser

ANTLR parsers are defined in grammar files, usually the same file as the associated lexer. From the ANTLR manual:

{ optional class code preamble }
"class" YourParserClass "extends Parser;"
options
tokens
{ optional action for instance vars/methods }
parser rules...

This is very similar to the syntax for lexers that we saw in Part 2. We will be focusing mostly on the parser rules, which define the sequences of tokens that the parser will accept, and how the AST should be constructed. Parser rules have the same high-level format as lexer rules:

rulename
    ":" alternative_1
    "|" alternative_2
    ...
    "|" alternative_n
    ";"

Parser rule names must begin with a lower case letter. Alternatives are expressed in terms of tokens (as opposed to characters for lexers). You can also use string literals – they are handled specially by ANTLR (the strings are placed in a literals table for the associated lexer, and literals are treated as tokens).

To parse our boolean expression language, we need to break down the language into ANTLR rules. When breaking the language down, we also encode the precedence of the operators. Referring back to the example tree above, you will recall that the higher precedence operators appear lower in the tree. So, for example, we can see that the general form of an “or” expression is:

orexpression
    :   andexpression ("or" andexpression)*
    ;

Note the use of the ‘*’ wildcard to indicate zero or more repetitions. Other wildcards, such as ‘+’ (one or more) and ‘?’ (zero or one) may also be used. Working our way from top to bottom, we can define the whole language in this way:

orexpression
    :   andexpression ("or" andexpression)*
    ;
andexpression
    : notexpression ("and" notexpression)*
    ;
notexpression
    : ("not")? atom
    ;
atom
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Lower precedence operators have their expressions defined in terms of the next-highest precedence operator. This language is missing something, however. It does not yet handle grouping (using parentheses “(…)”). Recall that the purpose of grouping is to override the normal precedence. In this way, a grouped expression is an indivisible element (atom) with regards to the surrounding operators. So, to introduce grouped expressions, we need to define an atom as either a primitive condition or a grouped expression:

atom
    : condition
    | LEFT_PAREN orexpression RIGHT_PAREN
    ;
condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Building the AST

Now we have a parser that will accept valid token streams for our language. The next step is to tell ANTLR to construct the AST, and to annotate the rules to describe how the tree should be built. Turning on AST generation is done with an option:

options {
    buildAST=true;
}

We guide the AST construction using postfix annotations on the tokens in our parser rules. The following annotations are available:

  • no annotation: a token without an annotation becomes a leaf node in the tree
  • ^: a token annotated with a carat becomes a sub-expression root
  • !: a token annotated with an exclamation point is not included in the tree

In our language, the operators form root nodes, so we will annotate them with the ‘^’ character. As the grouping operators are only used to override precedence, there is no need form them in the final AST, so we will omit them by using the ‘!’ annotation. This gives us our final parser definition:

class NotifyConditionParser extends Parser;
options {
        buildAST=true;
}
orexpression
    :   andexpression ("or"^ andexpression)*
    ;
andexpression
    : notexpression ("and"^ notexpression)*
    ;
notexpression
    : ("not"^)? atom
    ;
atom
    : condition
    | LEFT_PAREN! orexpression RIGHT_PAREN!
    ;
condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Generating the Parser

At last, we are ready to generate the parser code:

$ java -classpath antlr-2.7.6.jar antlr.Tool NotifyCondition.g

Now we get six files:

NotifyConditionLexer.java
NotifyConditionLexer.smap
NotifyConditionParser.java
NotifyConditionParser.smap
NotifyConditionParserTokenTypes.java
NotifyConditionParserTokenTypes.txt

Let’s take the parser for a test drive:

Reader reader = new StringReader(argv[0]);
NotifyConditionLexer lexer = new NotifyConditionLexer(reader);
NotifyConditionParser parser = new NotifyConditionParser(lexer);
       
try
{
    parser.orexpression();
    AST tree = parser.getAST();
    System.out.println(tree.toStringTree());
}
catch (RecognitionException e)
{
    System.err.println(e.toString());
}
catch (TokenStreamException e)
{
    System.err.println(e.toString());
}

All of the rules in our parser become methods. To parse an expression, we simply choose the highest-level expression: orexpression. The trees are printed using a lisp-like syntax. For example, the output when parsing the expression “changed.by.me or not (success and changed)” is:

( or changed.by.me ( not ( and success changed ) ) )

The tree nodes are shown in parentheses with the root first, followed by a list of children. We can also try an expression with valid tokens, but an invalid structure “success or or changed”:

line 1:12: unexpected token: or

The parser rejects the expression because there is no rule that will accept “or or”.

Wrap Up

We’re really getting there now. We have a fully functional parser that accepts valid expressions and produces ASTs to represent them. In the next part, we will see how to convert these ASTs into our own notification condition data structure using ANTLR tree parsers.

Liked this post? Share it!

2 Responses to “ANTLR By Example: Part 3: Parsing”

  1. December 13th, 2006 at 9:47 pm

    Bruno Vernay says:

    Very instructive !
    I just found that the parser rule’s names : orexpression, andexpression, notexpression are missleading.
    I would have understood quicker if they were named : precedence1, precedence2, precedence3 or so

    Thanks
    Bruno

  2. April 13th, 2011 at 10:43 pm

    Kenny says:

    How is the expression x—y parsed? Is it a legal expression?

Leave a Reply