Elasticsearch Tokenizers – Word Oriented Tokenizers

A tokenizer breaks a stream of characters up into individual tokens (characters, words…), then outputs a stream of tokens. We can also use tokenizer to record the order or position of each term (for phrase and word proximity queries), or the start and end character offsets of the original word which the term represents (for highlighting search snippets).

In this tutorial, we’re gonna look at how to use some Word Oriented Tokenizers which tokenize full text into individual words.

1. Standard Tokenizer

standard tokenizer provides grammar based tokenization:

Token:

To keep things simple, we can write term from tokens in this way:

Max Token Length

We can configure maximum token length (max_token_length – Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.
For example, we set max_token_length to 4, it makes QUICK separate to QUIC and K.

Term:

2. Whitespace Tokenizer

whitespace tokenizer breaks text into terms whenever it encounters a whitespace:

Term:

3. Letter Tokenizer

letter tokenizer breaks text into terms whenever it encounters a character which is NOT a letter.
For most European languages, it’s so good, but for some Asian languages, it becomes terrible because many words are not separated by spaces.

For example:

Term: dog's will be separated into dog and s:

4. Lowercase Tokenizer

lowercase tokenizer, like the letter tokenizer, but it also lowercases all terms:

Term:

5. UAX URL Email Tokenizer

uax_url_email tokenizer is like the standard tokenizer, but it recognises URLs and email as single tokens.

Term:

Max Token Length

We can configure maximum token length (max_token_length – Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.

Term:

6. Classic Tokenizer

classic tokenizer is good for English language documents. This tokenizer has heuristics for special treatment of acronyms, company names, email addresses, and internet host names:

– It splits words at most punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token (jsa. com -> [ jsa, com ], jsa.com -> [ jsa.com ])
– It splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split (Java-Sample -> [ Java, Sample ], Java9-Sample -> [ Java9-Sample ])
– It recognizes email addresses and internet hostnames as one token (like uax_url_email)

However, the rules doesn’t work well for most languages other than English.

Max Token Length

We can configure maximum token length (max_token_length – Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.

By JavaSampleApproach | November 9, 2017.

Related Posts


Got Something To Say:

Your email address will not be published. Required fields are marked *

*