Jun 22nd 2023

Code generation for Langium-based DSLs (2/3)

Christian Schneider

Hi everybody. This is the second part of my series on code generation for Langium-based DSLs. Last time I introduced the running example that I’m using throughout this series and discussed some issues with plain JavaScript template literals. I presented Langium’s tag function expandToString as our Solution A for tackling these issues. This function notably increases the utility of JavaScript templates. However, it still has some limitations. Today, I’m going to present our Solution B, which provides more flexibility and power in various regards.

Imagine you want to skip line breaks after certain lines if no content has been added to them. Imagine you need to be able to make the indentation of your code snippets configurable, or you need to post-process and adjust your generated code to meet certain criteria. Think about adding import clauses in case of generated languages like Java or JavaScript while adding references to symbols to the code. Generating rich expression syntaxes may also demand for more abstraction than that of plain strings. Last but not least we might want to associate the generated snippets with the source definitions they represent in terms of their regions within the text. Such requirements demand a different approach.

In this part I’ll focus on the two stage code generation approach and show a way of integrating that with the Tagged Templates I used in part one.

Generation tree

A possible approach to address the above outlined requirements is to split the generation task into two, and to use a data structure being more expressive than string for capturing the intermediate result. Task 1 builds up a description of the code to be generated, task 2 then renders the desired output.

In our daily practice a tree-like data structure has proven very beneficial. We’ve defined the following union of data types and refer to that as type Generated:

type Generated = string | GeneratorNode | undefined;
type GeneratorNode = CompositeGeneratorNode | IndentNode | NewLineNode;

The result of generation task 1 may be of type string already, e.g., if the result is very short. Normally, it will be of type GeneratorNode. In addition, it may be (of type) undefined. That doesn’t make much sense on top level, but it is very beneficial when moving parts of the template to sub routines. The possible result of undefined allows such functions to signal their caller that nothing is to be generated by that function, which I consider something else than, e.g., an empty string. I’ll elaborate on that below.

CompositeGeneratorNode implements the Composite design pattern. Instances of this type are containers and may accommodate a series of further strings and generator nodes. IndentNode is a specialization of CompositeGeneratorNode contributing indentation information. Instances of NewLineNode describe line breaks, they’re parameterizable in terms of their strictness.

In the early days of Langium we build code generators by composing generator descriptions programmatically, e.g., those being contained in the Langium CLI. That way code generator implementations get pretty much dominated by tons of node.append(...) or node.children.push(...) instructions, and the desired structure of the code to be generated quickly gets obfuscated.

… Via tagged templates

With Langium v1.0 we published another tag function named expandToNode, our Solution B. This function implements the composition of generation descriptions based on JavaScript template literals. Recall the example of generateModule2 from the arithmetics language example:

function generateModule2(root: Module): string {
    return expandToString`
········"use strict";
········(() => {
········  ${generateModuleContent(root)}
········})
    `;
}

Switching this to the two stage generation is easily done by replacing the tag function with expandToNode and changing the return type to Generated.

function generateModule3(root: Module): Generated {
    return expandToNode`
········"use strict";
········(() => {
········  ${generateModuleContent2(root)}
········})
    `;
}

As with expandToString, the indentation indicated by ········ is automatically removed from the template. Also the initial line break right after the opening back tick is omitted, as well as the line break and its subsequent white space right before the closing back tick.

The result of generateModule3(Module) must then be converted into a string, which I above referred to as generation task 2. To that end Langium provides the function named toString(unknown). If toString is called with an argument of type GeneratorNode it will render that into a string, otherwise it delegates to JavaScript’s default String constructor.

So far, so boring. Let’s now have a look at the implementation of generateModuleContent2(Module), also taken from last time:

function generateModuleContent2(module: Module): string {
    return expandToString`
        let ${lastComputableExpressionValueVarName};
        ${ module.statements.map(generateStatement).join('\n') }

        return ${lastComputableExpressionValueVarName};
    `;
}

Again, I replaced the tag function and return type as above. However, we don’t want to immediately join the generation results for the elements of statements to a string. Instead we want to create generation descriptions for each element and include those in this template’s result. To that end Langium provides the function joinToNode(). It is supposed to be used as illustrated subsequently in generateModuleContent3(Module):

function generateModuleContent3(module: Module): Generated {
    return expandToNode`
        let ${lastComputableExpressionValueVarName};
        ${ joinToNode(module.statements, generateStatement, { appendNewLineIfNotEmpty: true }) }

        return ${lastComputableExpressionValueVarName};
    `;
}

joinToNode takes a collection of elements to be visited as first argument, a function creating the generation description for each element, and an optional configuration object allowing to determine a separator or register additional callbacks like an element filter and prefix/suffix providers. If the input collection is empty or nothing is being generated for all elements joinToNode returns nothing, too, in practice denoted by undefined.

À propos undefined…

Why distinguishing `undefined`?

tl;dr: expandToNode can configure line breaks to be omittable. It does so if a line’s last substitution contributes undefined or an object of type GeneratorNode. If the remaining segments of that line contribute white space characters only, the entire line will be omitted while rendering the desired output.

Consider the example of generateModuleContent3(Module) again, and let module.statements be empty. What do you expect to be generated?
Without any special treatment we would get the following:

let lastComputableExpressionValueVarName;


return lastComputableExpressionValueVarName;

Why is that?
The call of joinToNode(...) yields nothing. Its trailing line break, however, is appended to the generated code and yields the first empty line. The empty line we requested in our template is then appended in addition and gets us two empty lines in a row. However, my personal preferences — and maybe also yours — is for omitting the entire line containing the call of joinToNode(...), i.e. ignore the line break following the substitution. In order to come to that preferred behavior, expandToNode inspects each line for placeholders/substitutions. If substitutions are contained, the last one’s value is assessed in the following way:
If the substitution value is undefined or of type GeneratorNode, then configure the line’s terminating NewLineNode to render to a line break only if the preceding line is non-empty. Otherwise, configure NewLineNode to render to a line break unconditionally.

For our example of generateModuleContent3(Module) with the list of statements being empty, this means that we’ll get a line break at the end of line 1, as that contains at least the static string let, i.e., is non-empty. Precisely, the NewLineNode being added to the generation description will always render to a line break regardless of its configuration. The placeholder in line 2 will resolve to the “value” of undefined. Consequently, the subsequent NewLineNode representing the line break at the end of line 2 will be tagged as ifNotEmpty, as described above. During the string rendering process later on (task 2) line 2 is evaluated to be empty, making the terminating NewLineNode to render to nothing.

Line 3 consists of just a line break — no substitution is contained at all — and causes an unconditional NewLineNode to be added to the generation description. Line 4 demands the addition of return· to the generation description, as well as the string value of the content of lastComputableExpressionValueVarName. The terminating line break is ignored as the template is closed in the next line.

With this approach of interpreting the template, the actual generation result will be equal to the following snippet and meet our preference:

let lastComputableExpressionValueVarName;

return lastComputableExpressionValueVarName;

Note: This approach also allows enforcing unconditional line breaks for lines containing just whitespace and substitutions potentially contributing undefined.
Just add ?? '' to the content of the (last) substitution, or add a further one like ${''} to the end of the line. expandToNode will then insert an unconditional NewLineNode. By the way: The latter option does the trick also for substitutions contributing potentially empty CompositeGeneratorNodes. 😉

Find the entire example at https://github.com/langium/langium-in-browser-codegen-example.

Remark: At the time writing I discovered a few minor issues in the implementation of expandToNode concerning some corner cases.
They’re fixed in Langium v1.2.1.

Benefits & what else?

The tag function expandToNode returns an instance of CompositeGeneratorNode representing the generation description of some piece of text. Such objects are arbitrarily composable and also manipulatable. Elements may be added, removed, or changed in their ordering. Moreover, since the concrete indentation of some text snippet described by a CompositeGeneratorNode is ultimately determined at text rendering time of its transitive container (task 2), creation and composition of a parent node and of some child nodes may happen entirely independently of each other. A child node might even be included in various positions at different levels of indentation within the same generation description. Furthermore, no hard-coded line break characters are required in templates or expressions of strings to be concatenated anymore.

In addition, generator implementations may switch back and forth between the tagged-template-based implementation style and the plain method-call-based style, depending on what fits best. The boundary between these two is blurry, since CompositeGeneratorNode defines further convenience methods. Some are mentioned below, refer to the Langium code base for their precise definitions:

append(...Generated[])
appendNewLine()
appendNewLineIfNotEmpty()
appendIf(boolean, ...Generated[])
appendTemplate`<template content>`
appendTemplateIf(boolean)`<template content>`
indent(Generated[])
…

They facilitate the concatenation of template parts in a fluent interface manner as follows, which might be preferable in certain cases.

function generateModuleContent3(module: Module): Generated {
    return expandToNode`
        let ${lastComputableExpressionValueVarName};
    `.appendNewLine()
     .appendIf(module.statements.length !== 0,
        joinToNode(module.statements, generateStatement, { appendNewLineIfNotEmpty: true })
    ).appendTemplate`

        return ${lastComputableExpressionValueVarName};
    `;
}

Lastly, the generation nodes-based approach paves the way for tracing generated portions and their corresponding source definitions in the DSL text. I will report on that in the third part of this series coming soon. I hope you enjoyed this read and invite you to leave feedback at Langium’s discussion board.

About the Author

Christian Schneider

Christian is a seasoned DSL enthusiast with a decade of expertise in language tooling and graphical visualizations. While striving for innovative and sustainable solutions, he still pays attention to details. Quite often you’ll find Christian nearby the sea, with the seagulls providing a pleasant ambience to his meetings.