Elasticsearch

Elasticsearch is a distributed, full-text search engine based on Lucene with JSON schema. It is an open source and implemented by Java.

I. Inverted Index
1. Introduction

Elasticsearch uses a structure called an inverted index. It is designed for the fastest solution of full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
For example:
– “The quick brown fox jumped over the lazy dog”
– ” Quick brown foxes leap over lazy dogs in summer”

elasticsearch-inverted index

Now, if we want to search for “quick brown”

elasticsearch-inverted-index

2. Problem

There are a few problems with our current inverted index:
– “Quick” and “quick” appear as separate terms, while the user probably thinks of them as the same word.
– “fox” and “foxes” are pretty similar, as are “dog” and “dogs”; They share the same root word.
– “jumped” and “leap”, while not from the same root word, are similar in meaning. They are synonyms.

3. Solution

– “Quick” can be lower-cased to become “quick”.
– “foxes” can be stemmed for reduced to its root form to become “fox”. Similarly, “dogs” could be stemmed to “dog”.
– “jumped” and “leap” are synonyms and can be indexed as just the single term “jump”.

elasticsearch-inverted-index-3

Note: Search for +Quick +foxes would still fail.
=> Analysis & Analyzer will be resolver.

II. Analysis & Analyzer
1. Analysis

Analysis is a process:
– Tokenizing a block of text into individual terms
– Normalizing these terms into a standard form

2. Analyzer

Analyzer is a wrapper that combines three functions into a single package:
– Character filters
A string is passed through any character filters in turn.
Tidy up the string before tokenization.
Example: Strip out HTML, or to convert & characters to the word and.

– Tokenizer
The string is tokenized into individual terms by a tokenizer.
Example: A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.

– Token filters
Change terms, remove terms, add terms.
Example: lowercasing “Quick”, remove stopwords such as {“a”, “and”, “the”} or add synonyms like “jump” and “leap”.

3. Custom Analyzers

Structural Custom Analyzer
PUT /my_index

Example:

Query

Result

III. Mapping

In order to be able to treat date fields as dates, numeric fields as numbers, and string fields as full-text or exact-value strings, Elasticsearch needs to know what type of data each field contains. This information is contained in the mapping.

Core Simple Field Typesedit
– String: string
– Whole number: byte, short, integer, long
– Floating-point: float, double
– Boolean: boolean
– Date: date

IV. Term-Based Search

The term query finds documents that contain the exact term specified in the inverted index

1. Why doesn’t the term query match my document?

-String fields can be analyzed (treated as full text, like the body of an email), or not_analyzed (treated as exact values, like an email address or a zip code). Exact values (like numbers, dates, and not_analyzed strings) have the exact value specified in the field added to the inverted index in order to make them searchable.

-The term query looks for the exact term in the field’s inverted index. This makes it useful for looking up values in not_analyzed string fields, or in numeric or date fields

Example

2. Finding Exact Value

Need setting not_analyzed for field.

V. Term-Based Versus Full-Text
1. Term-based queries

Queries like the term or fuzzy queries are low-level queries that have no analysis phase. They operate on a single term.
It is important to remember that the term query looks in the inverted index for the exact term only.

2. Full-text queries

Queries like the match or query_string queries are high-level queries
– If use them to query a date or integer field, they will treat the query string as a date or integer.
– If query an exact value (not_analyzed) string field, they will treat the whole query string as a single term.
– If query a full-text (analyzed) field, they will first pass the query string through the appropriate analyzer to produce the list of terms to be queried.

Once the query has assembled a list of terms, it executes the appropriate low-level query for each of these terms, and then combines their results to produce the final relevance score for each document.

VI. Full Text Search
1. Match Query

Elasticsearch executes the preceding match query as follows
– Check the field type.
– Analyze the query string.
– Find matching docs.
– Score each doc.
The term query calculates the relevance _score for each matching document, by combining the term frequency (how often quick appears in the title field of each document), with the inverse document frequency (how often quick appears in the title field in all documents in the index), and the length of each field (shorter fields are considered more relevant).

2. Phrase Matching

Phrase matching
– First analyzes the query string to produce a list of terms.
– Keeps only documents that contain all of the search terms, in the same positions relative to each other.

Example:

VII. Partial Matching

Partial matching allows users to specify a portion of the term they are looking for and find any words that contain that fragment

1. Prefix Query

-A low-level query that works at the term level
– Scale poorly and can put your cluster under a lot of strain. Try to limit their impact on your cluster by using a long prefix.

2. Wildcard and regexp Queries

– A low-level, term-based query similar.

VIII. Filters & Caching

Filters can be a great candidate for caching.

Internal Filter Operation
– Find matching docs: The term filter looks up the term XHDK-A-1293-#fJ3 in the inverted index and retrieves the list of documents that contain that term
– Build a bitset: The filter then builds a bitset which is an array of 1s and 0s for describing which documents contain the term. Matching documents receive a 1 bit.
– Cache the bitset: The bitset is stored in memory, since we can use this in the future and skip steps 1 and 2. This adds a lot of performance and makes filters very fast.

When executing a filtered query, the filter is executed before the query.

IX. Source

elasticsearch-source


Related Posts



Got Something To Say:

Your email address will not be published. Required fields are marked *

*