Full Text Search With Elasticsearch

Posted on
search elasticsearch

Full text search seems to be an easy problem but when taking a closer look we will find out that is problem is actually more complex. This blogpost is here to guide you through full text search and which trad-offs to make.

The basics: Analyzers, tokenizers and filters

Within Elasticsearch you will find the terms like analyzers, tokenizers and filters all over the place. Let’s figure out what they and what that means.

Tokenizers

Lets start with Tokenizers. A tokenizer breaks up text in tokens. How this is happening depends per tokenizer. There are many tokenizers for elasticsearch. See the Tokenizer reference on the Elasticsearch website.

Filters

A filter is a post processing method which starts when the tokens from the tokenizer are generated. For example the lowercase filter will lowercase all text in all generated tokens. A list of filters can be found here:
(Check the menu on the right side of the screen)

Analyzers

Analyzers is a combination of tokenizer and filter. An analyzer can have multiple filters but always have a single tokenizer. You can easily create custom analyzers on field level by combining a tokenizer with a filter per field.

The search process in a nutshell

The search process basically contains of two parts. The first part is the indexing phase. During index time incoming text will be processed by tokenizer and in after that the filter will do its work. The resulting tokens that comes out of this whole process will be stored and used for search.

The second part of the search process is the actual searching. Your search query will go through the search analyzer (so tokenized and filtered) and then elasticsearch will calculate the score and return all documents where you have hits.

For full text search it’s usually a requirement that you can search on parts of words. This requirement almost rules out all tokenizers. This why will mainly focus on two tokenizers: Ngram and Edge Ngram.

Ngram tokenizer

The ngram tokenizer splits text into words. Words will be split up in tokens of a minimum size X and maximum size Y. If tokens matches a full word or a token then there will be a hit. This way we can support full text search. Let’s see how this works with an example:

Example:
Word: “quick”

Minimal token size: 2
Max token size: 10

Will generate tokens:
“qu”, “ui”, “ic”, “ck”, “qui, “uic”, “ick”, “quic”, “uick”, “quick”

The resulting tokens are stored and we are able to search on those tokens.

Ngram versus Edge ngram

Ngram is by default not very efficient in terms of storage so for the example above we needed for a 5-letter word which generates 10 tokens. It’s not very likely that people search on “ui” or “ick” etc. It could be more interesting to go for the edge_ngram algorithm. With edge_gram we still split up words in tokens however we do only from the beginning word.

Example:
Word: “quick”

Minimal token size: 2
Max token size: 10
Will be tokenised into:

“qu”, “qui, “quic”, “quick”

Now we only produced 4 tokens, which is faster for search, and we will use way less storage and for the end user it’s probably just as effective.

I explained earlier that the search process generally speaking consists in two parts. When doing full text search it is important to know how this affects your search result. Usually you want to use the same analyser for both (this is a recommendation by Elasticsearch) but with ngrams you want to move away from this and this makes perfect sense! Let’s take an example:

Example:
Line 1: “the quick brown fox”
Line 2: “my brother in law is awesome”

Minimal token size: 3
Max token size: 10

Will generate tokens:
“the“, “qui“, “quic”, “quick”, “bro”, “brow”, “brown”, “fox”

And

“bro“, “brot“, “broth”, “brothe”, “brother”,
“law”, “awe”, “awes”, “aweso”, “awesom”, “awesome”

Imagine if you search with the term brown If you use edge_ngram as search analyser (the analyser DURING search) then brown will be tokenised to:

“bro”, “brow”, “brown”

This will give hits for both sentences because the token bro lives in both sentences however this is most likely not something you want.

To prevent this behaviour we should change the search analyser to the standard search analyser with Keyword tokenizer (this basically means that we don’t tokenize search terms!). By doing this we will start searching to tokens that equals brown, now only the “the quick brown fox” will pop up.

Token sizes

It’s also good to take look and play with the token sizes. During search you are most likely not searching one or two-letter words. With the min_gram parameter you can set minimum token size which will prevent words like “a”, “an”, “on” being indexed. The max_gram parameter set the maximum length on the token. A lower max_gram value will reduce storage space but will lead to less accurate results.

Example:
Words: “apples”, “application, “appearence”

Minimal token size: 2
Max token size: 3

Will generate tokens:
“ap “, “app”,
“ap “, “app”,
“ap “, “app”

In this example it’s impossible get any unique results. You either find nothing or all 3 results. So it’s important to find a nice trade off here.