xxestter
xxestter

Reputation: 509

What is Analyzer in Elasticsearch for?

I am having some issues understanding elastic search analyzer. What is it for and how to use it?

From this article, there is a tokenizer and token filter from a source text. Am I not able to understand the source text is from the URL or from the text inside the indexes? From the article, it says to execute "GET

http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&analyzer=snowball"

which is from the URL, but does this analyzer related to the search the text inside my indexes?

I am quite confused and sorry if my question sounds stupid.

Upvotes: 8

Views: 6855

Answers (2)

Rafiq
Rafiq

Reputation: 11465

Analyzer: An analyzer consists of three things 1. character filters 2. filters and 3. tokenizer. An analyzer is basically a package of these building blocks with each one of them changing the input stream. So when indexing a document it goes through the following flow:

  1. First one or more character filters can be added, a character filter receives a text field's original text and can then transform the value by adding removing, or changing characters. An example of this could be to strip out any HTML markup.

  2. Afterwords tokenizer split the text into individual tokens which will usually be words. So if we have a sentence with 10 words we would get an array of 10 tokens. An analyzer may only have one tokenizer by default a tokenizer name standard is used which uses a Unicode text segmentation algorithm, which basically splits by whitespace and also removes most symbols such as commas, periods, semi-colons, etc. That's because most symbols are not useful when it comes to searching as they are intended for being read by humans. Besides splitting text into tokens, the tokenizer is also responsible for recording the position of the tokens including the start and end character offsets for the words that the tokens represent. This makes it possible to map the tokens to the original words something that is used to provide highlighting of matching words to positions of tokens is used when performing Fosi phrase searchs and proximity searches.

  3. After splitting the text into tokens it runs through one or more token filters. the token filter may add, remove or change tokens. This is kind of similar to a character filter but token filters work with the token stream instead of a character stream. there are a couple of different token filters with the simplest one being a lowercase token filter which just converts all characters to lowercase. Another token filter that can be useful in many cases is stop. it removes common words which are referred to as stop words. Another token filter that is very useful is synonyms, which is useful in giving similar words the same meaning.

enter image description here

Upvotes: 2

TechnocratSid
TechnocratSid

Reputation: 2415

Analyzer is a wrapper which wraps three functions:

  • Character filter: Mainly used to strip off some unused characters or change some characters.
  • Tokenizer: Breaks a text into individual tokens(or words) and it does that based on certain factors(whitespace, ngram etc).
  • Token filter: It receives the tokens and then apply some filters(example changing uppercase terms to lowercase).

In a nutshell an analyzer is used to tell elasticsearch how the text should be indexed and searched.

And what you're looking into is the Analyze API, which is a very nice tool to understand how analyzers work. The text is provided to this API and is not related to the index.

In your case the GET request:

GET http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&analyzer=snowball

is equivalent to:

GET _analyze
{
  "analyzer" : "snowball",
  "text" : "I sing he sings they are singing"
}

which outputs:

{
  "tokens": [
    {"token": "i", "position": 1, ...},
    {"token": "sing", "position": 2, ...},
    {"token": "he", "position": 3, ...},
    {"token": "sing", "position": 4, ...},
    {"token": "sing", "position": 7, ...},
  ]
}

as mentioned in the article.

One more thing, let's say if you have defined a custom analyzer in your index that does a combination of character filtering, tokenizing and token filtering in your own way and you want to check how it will tokenize the text, then you can use the _analyze end point with your index name and even in that case you have to provide the text.

GET my_index/_analyze
{
  "analyzer" : "custom",
  "text" : "I sing he sings they are singing" --> You have to provide the text. 
}

Why analyzers?

Analyzers are generally used when you want to index a text or phrase, it is useful to break the text into words so that you can search on terms to get the document.

Example: Let's say you have an index (my_index) and in that index you have a text field (intro) and you index a document where "intro":"Hi there I am sid" and if you're not using analyzer then this will be stored as "Hi there I am sid". If you want to query for this document you will have to write the complete phrase (find documents where intro = "Hi there I am sid"). But if this phrase is indexed as tokens then even if you query for a token (find documents where intro="sid") you'll get the document.

Note: By default standard analyzer is used for all text fields.

Hope it helps !

Upvotes: 21

Related Questions