Elasticsearch

Elasticsearch is a distributed, full-text search engine based on Lucene with JSON schema. It is an open source and implemented by Java.

I. Inverted Index

1. Introduction

Elasticsearch uses a structure called an inverted index. It is designed for the fastest solution of full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
For example:
– “The quick brown fox jumped over the lazy dog”
– ” Quick brown foxes leap over lazy dogs in summer”

elasticsearch-inverted index

Now, if we want to search for “quick brown”

elasticsearch-inverted-index

2. Problem

There are a few problems with our current inverted index:
– “Quick” and “quick” appear as separate terms, while the user probably thinks of them as the same word.
– “fox” and “foxes” are pretty similar, so are “dog” and “dogs”. They share the same root word.
– “jumped” and “leap”, while not from the same root word, are similar in meaning. They are synonyms.

3. Solution

– “Quick” can be lower-cased to become “quick”.
– “foxes” can be stemmed for reduced to its root form to become “fox”. Similarly, “dogs” could be stemmed to “dog”.
– “jumped” and “leap” are synonyms and can be indexed as just the single term “jump”.

elasticsearch-inverted-index-3

Note: Searching for +Quick +foxes would still fail.
=> Analysis & Analyzer are the solution.

II. Analysis & Analyzer

1. Analysis

Analysis is a process:
– Tokenizing a block of text into individual terms
– Normalizing these terms into a standard form

2. Analyzer
2.1 Structure

Analyzer is a wrapper that combines three functions into a single package:
Character Filters (optional) + Tokenizer + Token Filters (optional)

Character Filters: preprocess (adding, removing, or changing) the stream of characters before it is passed to Tokenizer.

For example:
+ Strip out HTML: <b>helpful</b> => helpful
+ Map: LOL => _laugh_
+ Replace with Pattern: Java-Sample-Approach => Java_Sample_Approach

>> More details at: Elasticsearch Character Filters (HTML Strip, Mapping, Pattern Replace)

Tokenizer:
+ receives a stream of characters, tokenize them into individual tokens (usually individual words), then outputs a stream of tokens.

For example, tokenizing when encountering whitespace or punctuation:

+ responsible for recording the order or position of each term (used for phrase and word proximity queries) and the start and end character offsets of the original word which the term represents (used for highlighting search snippets):

>> More details at:
Word Oriented Tokenizers
Partial Word Tokenizers
Structured Text Tokenizers

Token Filters: receive stream of tokens from Tokenizer, then it can modify tokens (e.g: Quick => quick), delete tokens (remove stopwords such as {“a”, “and”, “the”}) or add tokens (synonyms like “jump” and “leap”).

2.2 Basic Analyzers

keyword analyzer returns the entire input string as a single token.

whitespace analyzer breaks text into terms whenever it encounters a whitespace character.

simple analyzer breaks text into lower cased terms whenever it encounters a character which is not a letter.

stop analyzer is just like simple analyzer, but supports removing stop words (_english_ stop words by default).

standard analyzer is the default analyzer. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm) and works well for most languages.

>> More details (with configuration and examples) at: Elasticsearch Analyzers – Basic Analyzers

2.3 Custom Analyzers

A Custom Analyzer is combination of:
character filters (optional) -> tokenizer -> token filters (optional)

In accordance with these components, it has following parameters:
char_filter (optional): array of built-in or customised character filters.
tokenizer (required): built-in or customised tokenizer (Word Oriented Tokenizers + Partial Word Tokenizers + Structured Text Tokenizers)
filter (optional): array of built-in or customised token filters.
position_increment_gap (optional): when indexing an array of text values, Elasticsearch inserts a fake “gap” between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to 100.

For example, with array "titles": [ "Java Sample Approach", "Java Technology"], the “gap” between term approach and term java is position_increment_gap.

>> More details at: Elasticsearch Analyzers – Custom Analyzer

III. Mapping

Mapping defines how a document and the fields inside are stored and indexed.

For example, use mappings to define:
– date fields as dates, numeric fields as numbers
– string fields as a text field for full-text search, or as a keyword field for sorting or aggregations
– the format of date values
– custom rules for dynamically added fields.

Field datatypes

Each field has a data type which can be:
– simple types: text, keyword, date, long, double, boolean or ip.
– types supports JSON: object or nested.
– or geo_point, geo_shape, or completion.

For example:

IV. Structure of a Search Request/Response

Elasticsearch search requests are JSON document-based requests or URL-based requests. The requests are sent to the server with the same format, so we should understand some important components that we can change for each search request and look at a typical response.

Search Scope:
All REST search requests use the _search endpoint and can be a GET/POST request. We can search entire cluster or limit the scope by specifying the names of indices or types in the request URL:

Basic Components:
Once we indicate the indices to search, we need to configure some important components:

query configures the best documents to return based on a score, as well as the documents you don’t want to return (using the query DSL and the filter DSL).
size indicates the amount of documents to return. Defaults to 10.
from is index of the documents to return. Defaults to 0.
_source specifies how the _source field is returned.
sort: Default sorting is based on the _score for a document. If we don’t care about the _score, adding a sort helps us to control which documents get returned.

Request body–based: When we execute advanced searches, it’s more flexibility and more options with request body–based searches:
+ Pagination and selected Fields
+ Wildcards in returned fields
+ Sort Oder

Structure of Response:

>> More details at: ElasticSearch – Structure of a Search Request/Response

V. Query DSL

Elasticsearch provides a full Query DSL based on JSON to define queries.

1. Context

The behaviour of a query clause depends on whether it is used in the context:

1.1 Query Context

The query clause answers the question:
“How well does this document match this query clause?”

>> We have 2 main requirements:
– whether or not the document matches
– how well the document matches, relative to other documents (that _score represents)

The order is of result depends on _score.

1.2 Filter context

The query clause answers the question:
“Does this document match this query clause?”

>> The response is just a simple Yes or No (without _score).

Frequently used filters will be cached automatically by Elasticsearch, to speed up performance. This context is mostly used for filtering structured data.

For example:
– Is post_date from “2017-10-25”?
– Does tags contain “firebase”?

Notice that _score is constant.

*Note: From Elasticsearch 5.6, the filtered query is replaced by the bool query.

We can mix 2 types of context in a Query Request.

>> More details at: ElasticSearch Filter vs Query

2. Term-level Query vs Full-text Query
2.1 Term-level Query

Queries like the term or fuzzy queries are low-level queries that have no analysis phase. They operate on a single term.

Term Query for text vs keyword
– text: full text (body of an email for example)
– keyword: exact value (for example: email address or a zip code – numbers, dates, and keywords)

text are analyzed: values -> analyzer -> list of terms -> inverted index.
keyword -> inverted index.

*Note: term query looks for the exact term in the field’s inverted index without knowing anything about the analyzer.

>> More details at:
Term & Terms Query
Range Query
Prefix Query & Wildcard Query
Regexp Query
Type & Ids Query
Fuzzy Query

2.2 Full-text Query

Queries like the match or query_string queries are high-level queries
– If use them to query a date or integer field, they will treat the query string as a date or integer.
– If query an exact value (not_analyzed) string field, they will treat the whole query string as a single term.
– If query a full-text (analyzed) field, they will first pass the query string through the appropriate analyzer to produce the list of terms to be queried.

Once the query has assembled a list of terms, it executes the appropriate low-level query for each of these terms, and then combines their results to produce the final relevance score for each document.

>> More details at:
Full Text Queries – Basic
Multi Match Query – Basic
Multi Match Query – More Practice
Simple Query String Query

3. Compound Queries

Compound queries wrap other compound or leaf queries to combine results and scores, to change behaviour, or to switch from query to filter context.

You can find many types of compound query: Constant Score, Bool, Dis Max, Function Score and Boosting Query at:
Elasticsearch Compound Queries

VI. Integration

1. Angular 4

Quick Start – How to add Elasticsearch.js
Add Document to Index
Get All Documents in Index
Documents Pagination with Scroll
Simple Full Text Search

2. Spring Boot

How to start SpringBoot ElasticSearch using Spring Data