How to Search for Complex Synonyms (Phrases) in Elasticsearch

Estafet consultants occasionally produce short, practical tech notes designed to help the broader software development community and colleagues. 

If you would like to have a more detailed discussion about any of these areas and/or how Estafet can help your organisation and teams with best practices in improving your SDLC, you are very welcome to contact us at enquiries@estafet.com

Introduction

Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack. It indexes, analyses, and searches unstructured data. Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organised in a pre-defined manner. Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well. 

It is the text search (and more specifically the search for expressions with synonyms) that will be discussed in the following article.

In particular I will demonstrate how synonym searches can be done in the context of (logically indivisible) keywords using analysis and transformations available only for the text data type. This allows flexibility and search efficiency far beyond the level of a partial match or unwanted additional word combinations.

I will first explain what Elasticsearch does well. I will then define what I wanted to achieve and why the implemented Elasticsearch functionality cannot help. Finally, I will explain exactly how it can be done.

Each document is a collection of fields, which each have their own data type. This type indicates the kind of data the field contains, such as strings or boolean values, and its intended use. For example, you can index strings to both text and keyword fields. However, text-field values are analysed for full-text search while keyword strings are left as-is for filtering and sorting.

Textual analysis goes through several stages

  • character filters – removes, replaces, or adds characters. For example, a specialised filter removes HTML elements like <b> from the stream. Or converts non-ASCII characters to their ASCII counterparts (if any). The analyser can have 0 or more character filters.
  • tokenization – breaking a text stream down into smaller chunks, called tokens. In most cases, these tokens are individual words. For example a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. “Enjoy your freedom!” becomes [Enjoy, your, freedom!]. The Analyser always has only one tokenizer.
  • token filters – A token filter receives the token stream and may add, remove, or change tokens. Here, the variety of built-in filters is great and they provide flexibility to the user. I will mention only a few:
  1. A lowercase token filter – converts all to lowercase. 
  2. А stop token filter removes common words (stop words).
  3. The synonym token filter allows the easy handling of synonyms during the analysis process. For “PC” we could define [computer, laptop] as synonyms.
  4. Stemmer token filter – stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

An analyzer may have zero or more token filters, which are applied in order. This allows, for example, in the presence of a stemmer and synonyms, тhe stemmer to be applied to the synonym entries also.

Keywords 

are mainly used for unchanged data such as IDs, email addresses, zip codes, etc. 

As I said above, Elasticsearch performs text analysis when indexing or searching text (https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) fields, while keywords (https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html) are left as-is. It is explicitly stated in the documentation: “Avoid using keyword fields for full-text search. Use the text field type instead.” 

Which behaviour of Elasticsearch didn’t meet my needs, and why I had to make my own implementation of synonyms

Elasticsearch finds documents that are close to the given expression and evaluates them. The more closely the content matches the criterion, the more points the resulting document has.

When searching for an expression in a document title or summary, Elasticsearch returns partial matches. This is a problem because the expression is usually a term and not just a collection of words. For elastic SCLC (“small-cell lung cancer”) and NSCLC (“non-small cell lung cancer”) are almost the same but for specialists searching for SCLC, the return of documents related to NSCLC is undesirable due to the semantic difference.

I participated in the development of just such a specialised application some time ago. By search expression (SE), documents are returned that best meet the set criteria. In its simplest form, SE contains one or more objects of study (drugs, pharmaceutical substances, active substances, etc.) and some additional words (effects or interactions). When searching for “DR1 safety” we would like to receive the document “DR1 – Incidence of Oral Adverse Events”, as it probably contains relevant information. In order to do this, we have defined these logically related phrases (domain-specific synonyms, aggregated from different sources) and the search algorithm has functionality for analysis, detection, and application of appropriate substitutes. Here and everywhere below, the meaning of a synonym is equivalent to a term (a word or phrase having a limiting and definite meaning in a specific research domain) and is used in this sense!

Let the user enter the following search expression “Use of DR1 for DVT or PE prophylaxis in patients with cancer” in the field. After removing the drug (DR1) and the stop words (“of”, “for”, “or”, “with”) we get the additional words “use dvt pe prophylaxis patients cancer”. We split AW into ngrams of words (see Splitting an expression into n-grams), and each ngram is searched for a match in our known synonymous expressions (see Searching for synonyms in the index). In this way, we define each ngram that has synonyms along with the list of its equivalent terms. Please note that if an n-gram (with synonyms) is part of a longer ngram, the shorter one is ignored. For example, if “risk of cancer” and “cancer” have synonyms, we keep only the longer (“risk of cancer”). This is normal as a longer expression means a more accurate context.

Since synonymous phrases are logically indivisible, they are treated as one word and searched with a “match_phrase” query, i.e. full match of the entire phrase. Please note that this is a single term, not the entire AW expression.

In order for this whole process to work, it is essential to have a specialised index to look up synonyms for a given ngram.

Index implementation

Synonyms are expressions that must match exactly, which automatically leads to recording in the “keyword” Elastic field. At the same time, I wanted to allow matching with stemming (1) (“risk of malignancies”, “risks of malignancy”) or paraphrasing (2) (“bone frontal”, “frontal bones”), which does not meaningfully change the expression but is only possible with full-text search.

To solve the first problem, I easily came to the conclusion that I should use a transformation leading to the base of the word, ie. tokenizer with stemmer. However, this produces a list of words (tokens) that are easily stored in Elastic, but it is difficult to compare the 2 lists for absolute identity. That is, if in a field of a document you have a saved list [“cancer”, “risk”] and you search for [“risk”, “toxic”] Elastic will return the document as a match, and this is absolutely undesirable because the match is partial.

This is easily fixed by connecting the words from the list into a phrase using some symbol. I chose a space(“ “). So [“cancer”, “risk”] becomes “cancer risk” and will not match “risk toxic”. Unfortunately, it won’t match the paraphrased [“risk”, “cancer”] “risk cancer” either. It occurred to me to add a new transformation that would unify lists with the same words but in a different order – sorting. After this step, both [“cancer”, “risk”] and [“risk”, “cancer”] result in “cancer risk”.

The described transformations allow an expression (text) to be brought to a basic form, convenient for comparison (sameness in the context of the keyword), although the words may be in a different form (number, gender, tense) and displaced.

The synonym index contains 2 fields (lists). In the field ext are the synonymous expressions themselves and in normalised, their normalised (after analyzer and sorting) versions without repetitions. It is in the second field that we look for matches to discover terms.

Splitting an expression into n-grams

By shifting to the right and removing 1 word we get a series of consecutive combinations.

For example, I will use “use DVT PE patients cancer”

The result is:

“use dvt pe patients cancer”

“use dvt pe patients”

“use dvt pe”

“use dvt”

“use”

“dvt pe patients cancer”

“dvt pe patients”

“dvt pe”

“dvt”

“pe patients cancer”

“pe patients”

“pe”

“patients cancer”

“patients”

“Cancer”

Search for synonyms in the index

For example, we will look for whether “malignant tumour risks” has synonyms

// STEP 1: Analyse the words in the expression to extract their basic forms

POST synonym-alias / _analyze

{“analyzer”: “no_synonyms_analyzer”, “text”: [“malignant tumour risks”]}

We get: “malign”, “tumour”, “risk”

// STEP 2: Sort roots and merge them with a space character (” “)

We get a “malign risk tumour”

// STEP 3: We are looking for a complete match in the normalised field

POST synonym-alias / _search? Size = 30 & source_includes = exp

{

“query”: {“term”: {“normalised”: {“value”: “malign risk tumour”}}}

}

We get 1 result with the following list of synonyms of the searched expression:

[“risks of the carcinogenesis”,

“risks of the malignancy”,

“cancer risks”,

“malignant neoplasm risks”,

“risks of the malignant tumours”,

“risks of the malignant neoplasms”,

“malignant tumour risks”,

“carcinogenesis risks”,

“risks of malignant neoplasms”,

“risks of malignant tumours”,

“risks of cancers”,

“risks of carcinogenesis”,

“risks of the cancers”,

“malignancy risks”,

“risks of malignancy”]

Conclusion

Elasticsearch is a distributed search and analytics engine. It has built-in capabilities allowing the user to filter, analyse, transform and index a stream of unstructured text down to the single character level. Each (indexed) document is a collection of fields, which each have their own type. The user can index strings to both text and keyword fields. The text field values are analysed for full-text search while keyword strings are left as-is for filtering and sorting. Here I have shown how synonym searches can be done in the context of (logically indivisible) keywords using analysis and transformations available only for the text data type. This allows flexibility and search efficiency far beyond the level of a partial match or unwanted additional word combinations.

By Delcho Delov, Consultant at Estafet

Stay Informed with Our Newsletter!

Get the latest news, exclusive articles, and updates delivered to your inbox.