Leveraging ANTLR to Develop Customized Search Solutions

Our tech team at MEV is constantly exploring new ways to leverage existing platforms to innovate and solve various problems. Recently, our team explored how we can leverage ANTLR to develop a highly customizable and efficient search engine and to visualize room layouts.

While these use cases are specific to the real estate domain, the solutions can be applied broadly across industries. Our team leverages scientific and technical knowledge, practical expertise, and craftship to develop optimal long-term solutions rather than just quick fixes.

There are various options available for searching published real estate listings, such as Zillow, RedFin, and the MLS. These systems have various limitations and aren’t limited to a specific database. With ANTLR, we can develop a highly custom search system to efficiently find listings in an established database, leveraging user criteria and custom parameters.

The Goal: A Custom Search System

We wanted to build a highly personalized and efficient search system that allows users to navigate a broad spectrum of options using over 300 parameters, including:

Numerical values or ranges of numbers, such as price or square meters
Dates, such as when the property was last sold
Boolean (logical) data type (e.g., "pool = yes" or "pool = no")
Location

In our case, the user may not even remember all these parameters, so we want to also suggest options for them. For example, if they want to find real estate in a specific location, we provide suggestions such as mountainous terrain, proximity to the ocean, distance from a volcano, and more. All of these are offered to the client when searching under the “Location” parameter.

The client should be able to migrate search queries created in another system with a different format into our system. Furthermore, we have implemented this capability and were able to identify and correct logical errors and fix incorrect queries. In simple terms, a client in another system configured a search, attempting to find housing within walking distance to the sea near New York. We migrated his search query to our system and caught a logical error.

We had to ensure the ease of query formation on the user's side (frontend). Additionally, we needed the ability to analyze these queries on the server side (backend). Finally, we had to translate these queries into a language understandable by the Elasticsearch search server or any other data storage that might be utilized in the future.

The Solution: A Custom Query Language

We had a metadata subsystem in place where fields for filters, their types, names, etc., were described as a set of server-side structures. Based on this and our goal, we concluded that the language would have a grammar resembling the concept of a predicate. More precisely, it would leverage a set of predicates combined with each other using the conjunction operation.

To implement the functionality of analyzing and translating queries on the server side, we chose ANTLR4. ANTLR (ANother Tool for Language Recognition) is a powerful parser generator used for reading, processing, executing, or translating structured text or binary files. It is widely employed in creating languages, tools, and frameworks.

Using the ANTLR grammar, it generates a parser capable of creating and traversing parse trees. Therefore, having defined the grammar of our query language for recognition, we need to describe it in ANTLR.

Components of ANTLR Text Processing

Lexer and Parser are the two main components used for text processing in ANTLR. These components perform different tasks and utilize different types of rules to analyze the input text.

The Lexer is responsible for lexical analysis of the text, which involves breaking down the input text into lexemes. A lexeme is the smallest unit of text that can be distinguished. The Lexer transforms the input text into a sequence of lexemes, each of which has properties such as the token type and value. The Lexer is also responsible for removing spaces, comments, and other characters that should not be considered in further analysis.

Parser, on the other hand, is responsible for syntactic analysis of tokens, meaning it checks whether the sequence of tokens conforms to the syntactic rules described in the grammar. The parser transforms the sequence of tokens into a syntax tree, representing the structure of the input text.

It is important to understand that the lexer differs from the syntax analyzer in that it works with tokens rather than a syntax tree. The lexer has access only to the current token, while the syntax analyzer can analyze all tokens and perform more complex logic.

So, if you think of parsing an input file as extracting information from the text, the lexer is responsible for finding tokens, and the syntax analyzer is responsible for executing logic. The lexer generates a sequence of tokens that is passed to the syntax analyzer (parser), which uses the grammar rules to check the correctness of the syntax. If the syntax conforms to the grammar, the parser generates a syntax tree representing the structure of the query in our case.

To make the development of grammar in ANTLR4 more convenient and productive, we utilized the “ANTLR v4" plugin for JetBrains IDEs, such as IntelliJ IDEA. This plugin assists developers in easily creating, editing, and debugging ANTLR4 language grammars.

Challenges Addressed

Leveraging ANTLR4, we addressed several challenges, including:

Dealing with a complex structure of entities, forming a query that needed to be built in the correct order and hierarchy. We pre-built a tree of variables and expressions to construct the query based on this tree and correctly utilize NestedQueryBuilder.
Handling queries for String data types, requiring auxiliary functionality for additional query processing before sending them to Elasticsearch, especially concerning special characters.
Developing functionality for dynamically setting the boost parameter to elevate search results higher in the overall resulting set. Another interesting aspect was developing functionality for building QueryStringQueryBuilder.
Dynamically tuning word matches, setting a minimum number of matches, and configuring synonyms for search, as well as selecting and configuring the analyzer.

In our case, developing an individual query language based on ANTLR4 allowed us to create a powerful tool for syntactic analysis and processing of input queries. Currently, we can easily modify the language grammar to meet our needs, update the codebase of the ANTLR4-based functionality, and expand or modify the translation functionality as needed.

Next, we will show another example of using ANTLR4 in real estate technology.

Viktor Moroziuk

Senior Software Engineer / Tech Lead

Software development company

Building a Real Estate Search System with ANTLR

The Goal: A Custom Search System

The Solution: A Custom Query Language

Components of ANTLR Text Processing

Challenges Addressed

Related Articles