Back to blog results

1월 27, 2021 By Sebastian Slepowronski

Building autocomplete with ANTLR and CodeMirror

What’s the problem?

At Sumo Logic, we’re dealing with a large amount of data. To help our customers explore the data quickly and effectively, our product lets them write Logs, Metrics, and Tracing queries. One of the challenges we dealt with recently was improving the query building experience in our new, revamped Metrics UI.

We’ve created the basic query builder where users can build the queries using a structured UI. However, more advanced users are used to the quicker, keyboard-first experience of writing free-text Metrics Query Language queries. Given that it’s literally impossible to remember all the names and feasible combinations of metric dimension and their values, we’d like to offer our customers an autocomplete mechanism that would aid the query building process.

How to suggest relevant pieces of data to the users?

As a result, we had to come up with a way to display autocomplete suggestions to the users in a reliable, fast, and maintainable way.

We’ve been choosing between several options fulfilling the three conditions above:

Suggestions computed on the backend side

It’s a good option because the backend has comprehensive knowledge about the query language but the biggest disadvantage would be a relatively slow perceived performance because every change of the query string or caret position would trigger the backend request. Therefore, it doesn’t fit the requirement of being a fast user experience.

Simple and naive query string parsing on the frontend side

Working directly with the query string on a low level is the solution being most straightforward to implement but potentially very hard to maintain in the long run. What if the query language needs to be changed? The custom parsing algorithm needs to be adjusted then. The same story when the language is extended. This way is easy to develop right now but it comes with an additional development cost at the end, leaving the tech debt behind (remember, every debt needs to be paid off, sooner or later).

Lexing and parsing query string on the frontend side

As a result, it turned out that the best option is to have a complete definition of the query language but with the speed improvements (like caching or computing the suggestions without the network request) possible to achieve only on the frontend side.

Then, the question arises: how to parse the string to have the full context and knowledge about its building parts?

ANTLR is an answer. It’s a powerful parser generator. With its help, we will be able to parse the raw incomprehensible string into an understandable data structure, the so-called Abstract Syntax Tree (AST). We’ll achieve that by writing the language grammar file (you can see a simple example here and a set of more complex ones there) and generating the runtime code with TypeScript target for ANTLR 4.

We’ll do exactly that right now. At the end of this blog post, we’ll have a working autocomplete based on the simple grammar accepting the key=value syntax. The UI part will be handled by the React framework and CodeMirror library together with its show-hint plugin responsible for controlling the autocomplete dropdown.

For the sake of clarity, we’ll focus only on the crucial pieces of code below but you can find the whole project on GitHub. I encourage you to clone the repository and play with it.

Let’s define the project and dependencies

Before digging into the details of parsing the query string, shall we start with creating the base React TypeScript application? Of course! We can do it very quickly using the create-react-app CLI tool:

> npx create-react-app autocomplete --template typescript

Then, go to the autocomplete folder and install all the required dependencies:

  • antlr4ts runtime ANTLR TypeScript library to import the classes required during the lexing and parsing process. For instance, CharStreams (being a simple stream of characters) or CommonTokenStream (a stream of tokens coming from the lexing - don’t worry, we’ll follow this process in the next section)

  • CodeMirror together with its React wrapper for rendering the code editor input together with the show-hint plugin to display the suggestions

> yarn add antlr4ts codemirror react-codemirror2

Then, proceed with dev dependencies:

  • antlr4ts-cli tool to generate code out of our language grammar file

  • CodeMirror’s types because we don’t want to give up the perks of having the TypeScript typings

> yarn add -D antlr4ts-cli @types/codemirror

We’ve got the base application at last so we can start building upon that.

How to understand the custom query language?

The backend can return a list of key or value suggestions for a given term, but we need to have a good understanding of the keys and values present in the input string typed by the user first. Moreover, it’s crucial to know whether caret is currently at the key or value. We’ll use this information combined to construct the backend request.

To retrieve such data, the query string needs to be parsed to the Abstract Syntax Tree (AST).

This process contains two steps:

  • Lexing - generating a set of tokens from the input query string

  • Parsing - using the tokens from the previous step to generate the final AST

Lexing and parsing the ‘customKey=customValue’ query string

AST is a data structure representing the input string in such a way that tokens and parser rules are nodes of the tree. If you’d traverse tree leaves in order, you should be able to assemble the query string again.

Technically speaking, we need to define our .g4 language grammar file containing lexer rules defining the matchers for tokens in the string and parser rules allowing us to generate the tree by specifying the relationships between rules and tokens.

ANTLR will help us (thank you!) to generate the runtime parser out of the grammar file that we’ll create in a moment.

You could ask: what is grammar after all?

I’m eager to answer this question.

The grammar is a set of lexer and parser rules describing our whole language. In other words, it specifies the language and sets the boundaries to clearly define what’s allowed and what’s not.

The lexer is a set of the basic low-level rules where each one has the matching expression defined. The lexer will return tokens matched by the specified rules out of the given input string. As a result, we’ll receive the set of tokens used later in the parsing process.

We can define two lexer rules:

ALPHANUMERIC: [a-zA-Z] [0-9a-zA-Z]*;
EQ: '=';

The first one matches the alphanumeric word starting with the letter while the second one accepts only direct use of the equal sign.

These rules define two possible tokens (ALPHANUMERIC and EQ) that can be retrieved from the string.

The parser rules define relationships between the parser and lexer rules. It allows generating the final AST out of the tokens coming from the previous lexing process.

Let’s define the set of rules making up our language:

expression: (key | keyValueExpression) EOF;
keyValueExpression: key EQ value;
key: ALPHANUMERIC;
value: ALPHANUMERIC | /* empty */;

In essence, our grammar accepts either the key (being the alphanumeric word) itself or the combination of key and value (possibly empty) connected with the equal sign. EOF, being an explicit marker for the end of the expression, isn’t strictly required but there are some unresolved edge cases yet so... better safe than sorry.

Based on the lexer and parser rules, the Abstract Syntax Tree will be created on runtime and the mentioned rules will create nodes of this tree.

Congratulations! We’ve just created a simple grammar allowing for either key=value expression or key alone.

customKey
customKey=
customKey=customValue

Note: There is a very useful ANTLR v4 plugin for IntelliJ-based IDEs allowing you to easily explore AST and improve your grammar. Simply create the .g4 file and play with that to see all the possible outcomes.

ANTLR, please generate TypeScript files for us

As soon as we’ve got our .g4 grammar file, we need to generate the runtime code that we will be able to use in our application. Even though there is an official ANTLR JavaScript target, we will use the TypeScript one to not lose all the benefits coming from the static typing.

The most convenient way is adding the script to package.json:

"scripts": { … "grammar": "antlr4ts -visitor ./src/grammar/KeyValue.g4" }

and running it afterwise:

> yarn grammar

As a result, a bunch of TS files is created in the same folder as the .g4 one. They contain the foundation of our grammar: KeyValueLexer.ts and KeyValueParser.ts.

Additionally, we should pay special attention to theKeyValueVisitor.ts as we’ll implement this interface in our custom visitor. It defines a set of visit functions for every node of the AST. This file has been generated because, as you could notice, we’ve used the -visitor option. Alternatively, it would be a listener created instead. The differences between these two are quite subtle: coming down to the fact that the listener always goes through all nodes of the tree. In the case of the visitor, we’re deciding whether child nodes should be visited or not. More control is always a good thing so let’s proceed with the latter approach.

It’s high time to write some real visitor’s code

We’ve got the project and all the files ready so we can finally start writing our code.

Let’s take a look at the data model as a result of walking through the tree. This is the model materializing options from our grammar directly: it’s either key or the value within a given key. Additionally, the result contains a range which is the start and end index of the detected token: key or value respectively.

// Token range - start and end included
export interface Range {
    start: number;
    end: number;
}

// Search key
interface KeyResult {
    type: 'KeyResult';
    key: string;
    range: Range;
}

// Search value within given key
interface ValueResult {
    type: 'ValueResult';
    key: string;
    value: string;
    range: Range;
}

export type KeyValueResult = KeyResult | ValueResult | undefined;

The visitor KeyValueResultVisitor extending the AbstractParseTreeVisitor coming from antlr4ts is the heart of our solution.

The final result is due to two factors: the query string and the current caret position:

Query string

Caret position (index)

Result

abc

0 - 3

{
    type: ‘KeyResult’,
    key: ‘abc’,
    range: {
        start: 0,
        end: 2
    }
}

abc=

0 - 3

{
    type: ‘KeyResult’,
    key: ‘abc’,
    range: {
        start: 0,
        end: 2
    }
}

abc=

4

{
    type: ‘ValueResult’,
    key: ‘abc’,
    value: ‘’,
    range: {
        start: 4,
        end: 4
    }
}

abc=xyz

0 - 3

{
    type: 'KeyResult',
    key: 'abc',
    range: {
        start: 0,
        end: 2
    }
}

abc=xyz

4 - 7

{
    type: 'ValueResult',
    key: 'abc',
    value: 'xyz',
    range: {
        start: 4,
        end: 6
    }
}

We’ll implement several visit functions to define a logic performed when the corresponding nodes of the tree are visited.

We should start by visiting the root node of the tree (expression). If the original query string is empty, we want to search all the keys. Otherwise, proceed with visiting the child nodes.

visitExpression(node: ExpressionContext): KeyValueResult {
    if (node.text === '') {
        return {
            type: 'KeyResult',
            key: '',
            range: {
                start: 0,
                end: 0,
            }
        };
    }
    return this.visitChildren(node);
}

Next, let’s visit the ‘key=value’ node where we want to distinguish cases of being at the key or value, depending on the caret position.

visitKeyValueExpression(node: KeyValueExpressionContext): KeyValueResult {
    const key = node.children?.find((child) => child instanceof KeyContext);
    const value = node.children?.find((child) => child instanceof ValueContext);

    // Both key and value are defined, caret is positioned at the value -> search value within the given key
    if (key !== undefined && value !== undefined && this.isWithinCaretPosition(value)) {
        return {
            type: 'ValueResult',
            key: key.text,
            value: value.text,
            range: KeyValueResultVisitor.getRange(value),
        };
    }
    // Caret is positioned at the key -> search for keys filtered by the given key text
    else if (key !== undefined && this.isWithinCaretPosition(key)) {
        return {
            type: 'KeyResult',
            key: key.text,
            range: KeyValueResultVisitor.getRange(key),
        };
    }
    return this.defaultResult();
}

Finally, we can handle the case of visiting the key alone (when ‘=value’ part is not typed yet). We should search for keys filtered by the given key text then.

visitKey(node: KeyContext): KeyValueResult {
    return {
        type: 'KeyResult',
        key: node.text,
        range: KeyValueResultVisitor.getRange(node),
    };
}

Create util to see the visitor in action

We’ve got the visitor already so there are no obstacles to use it in the util function.

export const parseQuery = (query: string, caretPosition: number): KeyValueResult => {
    // Create input stream from the given query string
    const inputStream = CharStreams.fromString(prepareQuery(query));
    // Create lexer
    const lexer = new KeyValueLexer(inputStream);
    const tokenStream = new CommonTokenStream(lexer);
    // Create parser based on the tokens from lexer
    const parser = new KeyValueParser(tokenStream);

    // Create Abstract Syntax Tree based on the root 'expression' from the parser
    const tree = parser.expression();

    // Visit the tree to gather the result
    const visitor = new KeyValueResultVisitor(caretPosition);
    return visitor.visit(tree);
};

const prepareQuery = (query: string): string => {
    // Remove whitespaces at the end of query string
    return query.trimEnd();
};

The Big Moment - use the util in our component

Without undue delay, import required dependencies. Primarily, pay attention to the required CodeMirror dependencies together with its show-hint plugin:

// react-codemirror2
import {UnControlled as CodeMirrorEditor} from 'react-codemirror2'
// codemirror
import CodeMirror, {Editor} from 'codemirror';
import 'codemirror/lib/codemirror.css';
import 'codemirror/theme/material-ocean.css';
// codemirror - show-hint
import 'codemirror/addon/hint/show-hint';
import 'codemirror/addon/hint/show-hint.css';

Define types that will be used in the state of our function component:

// CodeMirror's base and show-hint type
type CodeMirrorEditorType = CodeMirror.Editor & Editor;

// Query info encapsulating the query string and current caret position
interface QueryInfo {
    query: string;
    caretPosition: number;
}

// Store for the fetched suggestions together with the result of parsing
interface SuggestionsInfo {
    result: KeyValueResult;
    suggestions: string[];
}

Finally, let’s define the AutoComplete function component:

export const AutoComplete: FunctionComponent = () => {
...

Callback performed when either the value or caret position in the CodeMirror has been changed. Update the query info in the state then:

const onChange = useCallback((editor: CodeMirror.Editor) => {
    setQueryInfo({
        query: editor.getValue(),
        caretPosition: editor.getCursor().ch,
    });
}, []);

As soon as the query info is changed, parse the query using the util function that we’ve written previously.

Use the result of visiting the AST and pass it to the fetchSuggestions(). This is the function sending a request to the backend and resolving the Promise with the fetched suggestions. In our case, the function should distinguish between the “keys” and “values within key” scenarios to send a correct request.

Additionally, we’re receiving the range from visiting the AST which is the range of the token that we've got the caret currently at. In the case of “keys”, it will be the start and end index of the key token. Otherwise, for “values within key”, it will be the range of the value. We’ll use this information later to tell CodeMirror’s show-hint plugin where to put the picked suggestion.

const prevQueryInfo = usePrevious(queryInfo);
useEffect(() => {
    ...
    const result = parseQuery(queryInfo.query, queryInfo.caretPosition);

    fetchSuggestions(result).then((fetchedSuggestions) => setSuggestionsInfo({
        result,
        suggestions: fetchedSuggestions,
    }));
}, [codeMirrorEditor, prevQueryInfo, queryInfo]);

Finally, get the fetched suggestions and display them using the CodeMirror’s show-hint.

useEffect(() => {
    ...
    const isKeyResult = suggestionsInfo.result?.type === 'KeyResult';
    const options = {
        // Don't complete automatically in case of only one suggestion
        completeSingle: false,
        hint: () => ({
            from: { line: 0, ch: suggestionsInfo.result?.range.start ?? 0 },
            to: {
                line: 0,
                ch: (suggestionsInfo.result?.range.end ?? 0) +
                      (isKeyResult ? 1 : 0) + // for key result, extend index by 1 to cover the '=' in query
                      1, // end index is excluded so let's add 1
            },
            list: suggestionsInfo.suggestions.map((text, index) => ({
                // Append '=' to the key and ' ' to the value
                text: isKeyResult ? `${text}=` : `${text} `,
            })),
        }),
    };
    codeMirrorEditor.showHint(options);
}, [codeMirrorEditor, suggestionsInfo]);

Our function component simply renders the uncontrolled CodeMirrorEditor coming from react-codemirror2.

return (
    <CodeMirrorEditor
        options={{
            theme: 'material',
            lineNumbers: true
        }}
        editorDidMount={setCodeMirrorEditor}
        onCursorActivity={onChange}
    />
);

This component is relatively maintainable but as soon as the new functionality is added and the codebase extended, particular steps of the parsing and fetching process should be extracted into the independent hooks.

Final words

We’ve been able to write our own simple ANTLR grammar, generate TypeScript classes out of that, and display the suggestions using the CodeMirror’s show-hint plugin. Such an implementation gives clear and tangible benefits to the users. They can use all the power of Metrics Query Language but don’t need to remember all the keywords and variables’ names which greatly fastens the experience of writing the queries. It’s a perfect combination of two worlds: simplicity of the basic UI query builder with the speed of writing queries using the keyboard in the advanced mode.

That sounds great but what’s next?

Language is a very dynamic and living matter so it will surely change over time. In such a case, you’ll update the .g4 grammar file, re-generate classes using antlr4ts, and probably add the new visit functions depending on your needs. I guess that you’ll also extend your language by adding the syntax to the existing foundation. I encourage you to explore various existing ANTLR grammars to get inspiration as possibilities are nearly limitless. Good luck.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Sumo Logic cloud-native SaaS analytics

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Sebastian Slepowronski

Sebastian Slepowronski

Senior Software Engineer

More posts by Sebastian Slepowronski.

People who read this also enjoyed