Parsing simple XML-like, structural languages with Xtext is a no-brainer. However, parsing nested expressions is often considered a bit more complicated. This is due to their recursive nature and also because with Xtext you have to avoid left-recursive parser rules. As the underlying parser (generated by Antlr) uses a top-down approach, it would recurse endlessly if you had a left-recursive grammar.

Let’s have a look at parsing a simple arithmetic expression:

`2 + 20 * 2`

If you know EBNF a bit and wouldn’t think about avoiding left recursion, operator precedence or associativity, you’ld probably write a grammar like this:

1 2 3 4 |
Expression : Expression '+' Expression | Expression '-' Expression | INT; |

This grammar would be left-recursive because the parser reads the grammar top-down and left to right and would endlessly call the Expression rule without consuming any characters, i.e. altering the underlying state of the parser. While this kind of grammars can be written for bottom-up parsers, you’ld still have to deal with operator precedence in addition. That is define that a multiplication has higher precedence than an addition for example.

In Xtext you define the precedence implicitly when left-factoring such a grammar. Left-factoring means you get rid of left recursion by applying a certain technique, which I will show in the following.

So here is a left-factored grammar (not yet working with Xtext) for the expression language above :

1 2 3 4 5 6 7 8 |
Addition : Multiplication ('+' Multiplication)*; Multiplication: NumberLiteral ('*' NumberLiteral)*; NumberLiteral: INT; |

As you can see the main difference is that we have three rules instead of one, and if you look a bit closer you see that there’s a certain delegation pattern involved. The rule Addition doesn’t call itself but calls Multiplication instead. The operator precedence is defined by the order of delegation. The later the rule is called the higher is its precedence. This is at least the case for the first two rules which are of a left-recursive nature (but we’ve left-factored them now). The last rule is not left-recursive which is why you can write them down without applying this pattern.

We should allow users to explicitly adjust precedence by adding parentheses, e.g. write something like `(2 + 20) * 2`

. So let’s add support for that (note that the grammar is still not working with Xtext):

1 2 3 4 5 6 7 8 9 10 11 12 |
Addition : Multiplication ('+' Multiplication)*; Multiplication: Primary ('*' Primary)*; Primary : NumberLiteral | '(' Addition ')'; NumberLiteral: INT; |

Once again: if you have some construct that recurses on the left hand side, you need to put it into the delegation chain according to their operator precedence. The pattern is always the same, the thing that recurses delegates to the rule with the next higher precedence.

### Construction of an AST

Now that we know how to avoid left-recursion, let’s have a look at what the parser produces. In Xtext each rule returns some value:

- Parser rules return AST nodes (i.e. EObject instances),
- enum rules return enum literals and
- datatype rules as well as
- terminal rules return simple values like strings and the like (
*EDatatype*in EMF jargon).

Xtext can automatically infer whether some rule is a parser rule, i.e. constructs and returns an AST node, or if it is a datatype rule. The grammars above only consisted of datatype rules so all they would produce is a string.

In order to construct an AST we need to add Assignments and Actions. But before we do that we need to talk about return types.

The return type of a rule can be specified explicitly using the ‘*returns*‘ keyword but can be inferred if the type’s name is the same as the rule’s name. That is

1 |
NumberLiteral : ... ; |

is a short form of

1 |
NumberLiteral returns NumberLiteral : ... ; |

However, in the case of the expressions grammar above, the rules all need to return the same type since they are recursive. So in order to make the grammar functional we need to add a common return type explicitly (but the grammar is still missing some bits):

1 2 3 4 5 6 7 8 9 10 11 12 |
Addition returns Expression: Multiplication ('+' Multiplication)*; Multiplication returns Expression: Primary ('*' Primary)*; Primary returns Expression: NumberLiteral | '(' Addition ')'; NumberLiteral: INT; |

The AST type inference mechanism of Xtext will infer two types: Expression and NumberLiteral. Now we need to add assignments and Actions in order to store all the important information in the AST and to create reasonable subtypes for the two operations.

In the following you see the final fully working Xtext grammar:

1 2 3 4 5 6 7 8 9 10 11 12 |
Addition returns Expression: Multiplication ({Addition.left=current} '+' right=Multiplication)*; Multiplication returns Expression: Primary ({Multiplication.left=current} '*' right=Primary)*; Primary returns Expression: NumberLiteral | '(' Addition ')'; NumberLiteral: value=INT; |

Let’s go through the grammar as the parser would do it for the expression

`( 1 + 20 ) * 2`

*(I’m sure it’s pretty hard to follow what’s going just by reading this text. Therefore I have prepared a small interactive slide show. You can find it at the end of this section.)*

The parser always starts with the first rule (*Addition*). Therein the first element is an unassigned rule call to *Multiplication* which in turn calls *Primary*. *Primary* now has two alternatives. The first one calls *NumberLiteral* which consists only of one assignment to a feature called ‘value’ of what the *INT* rule returns.

But as the first token in the expression is an opening parenthesis ‘*(*‘, the parser will take the second alternative in *Primary*, consume the ‘(‘ and call the rule *Addition*. Now the value ‘1’ is the look-ahead token and again *Addition* calls *Multiplication *and *Multiplication* calls *Primary*. This time the parser takes the first alternative because ‘1’ has been consumed by the *INT* rule (which btw. is a reused library terminal rule).

As soon as the parser hits an assignment it checks whether an AST node for the current rule has already been created. If not it will create one based on the return type, which is *NumberLiteral*. The Xtext generator will have created an EClass ‘NumberLiteral’ before, which can now be instantiated. That type will also have a property called *value* of type Integer, which will get the value ‘1’ set. This is what the Java equivalent would look like:

1 2 3 4 5 6 |
// value=INT if (current == null) { current = new NumberLiteral(); } current.setValue(ruleINT()); ... |

Now that the rule has been completed, the created EObject is returned to the calling rule *Primary*, which in turn returns the object unchanged to its own caller. Within *Multiplication* the call to *Primary* has been successfully parsed and returns an instance of *NumberLiteral*. The second part of the rule is a so called g*roup* (everything within the parenthesis). The asterisk behind the closing parenthesis states that this part can be consumed zero or more times. The first token to consume in this part is the multiplication operator ‘*’. Unfortunately in the current situation the next token to consume is the plus operator ‘+’, so the group is not consumed at all and the rule returns the result of the unassigned rule call (the *NumberLiteral*) .

In rule *Addition* there’s a similar group, but this time it expects the correct operator so the parser goes into the group. The first element in the group is a so called a*ction*. As Xtext grammars are highly declarative and bi-directional, it is not a good idea to allow arbitrary expression within actions as it is usually the case with other parser generators. Instead we only support two kinds of actions. This one will create a new instance of type *Addition* and assign what was the to be returned object to the feature *left*. In Java this would have been something like:

1 2 3 4 5 6 7 |
// Multiplication rule call current = ruleMultiplication(); // {Addition.left=current} Addition temp = new Addition(); temp.setLeft(current); current = temp; ... |

As a result the rule would now return an instance of *Addition* which has a *NumberLiteral* set to its property *left*. Next up the parser consumes the ‘+’ operator. We do not store the operator in the AST because we have an explicit *Addition* type, which implicitly contains this information. The assignment (`right=Multiplication`

) calls *Multiplication* another time and assigns the returned object (a *NumberLiteral* of value=20) to the property named *right*.

If we now had an additional plus operation ‘+’ (e.g. `1 + 2 + 3`

) the group would match another time and create another instance of *Addition*. But we don’t and therefore the rule is completed and returns the created instance of *Addition* to its caller which was the second alternative in *Primary*. Now the closing parenthesis is matched and consumed and the stack is reduced once more.

We are now in rule *Multiplication* and have the multiplication operator ‘*’ on the look ahead. The parser goes into the group and applies the action. Finally it calls the Primary rule, gets another instance of *NumberLiteral* (value=2), assigns it as the ‘right’ operand of the *Multiplication,* and returns the *Multiplication* to *Addition* which in turn returns the very same object as there’s nothing left to parse.

The resulting AST looks like this:

The slide show below illustrates how the parsing process goes.

### Associativity

There is still one topic I should mention, which is associativity. There is left and right associativity as well as no associativity. In the example we have seen left associativity. Associativity tells the parser how to construct the AST when there are two infix operations with the same precedence. The following example is taken from the corresponding wikipedia entry:

Consider the expression `a ~ b ~ c`

. If the operator ~ has left associativity, this expression would be interpreted as `(a ~ b) ~ c`

and evaluated left-to-right. If the operator has right associativity, the expression would be interpreted as `a ~ (b ~ c)`

and evaluated right-to-left. If the operator is non-associative, the expression might be a syntax error, or it might have some special meaning.

We already know the most important form which is left associativity:

1 2 |
Addition returns Expression: Multiplication ({Addition.left=current} '+' right=Multiplication)*; |

Right associativity is done using the following pattern (note the quantity operator and the call to the rule itself at the end):

1 2 |
Addition returns Expression: Multiplication ({Addition.left=current} '+' right=Addition)?; |

And if you don’t want to allow multiple usages of the same expression in a row (hence non-associativity) you write:

1 2 |
Addition returns Expression: Multiplication ({Addition.left=current} '+' right=Multiplication)?; |

Note that sometimes it’s better to allow associativity on parser level, but forbid it later using validation, because you can come up with a better error message. Furthermore the whole parsing process won’t be interrupted, so your tooling will generally be more forgiving.

FelipeApril 26, 2016 at 04:06Great!

SamFebruary 4, 2017 at 16:23Excellent article, really helped me understand!