Sasha
Sasha

Reputation: 43

how to tokenize and search with special characters in ElasticSearch

I need texts like #tag1 quick brown fox #tag2 to be tokenized into #tag1, quick, brown, fox, #tag2, so I can search this text on any of the patterns #tag1, quick, brown, fox, #tag2 where the symbol # must be included in the search term. In my index mapping I have a text type field (to search on quick, brown, fox) with the keyword type subfield (to search on #tag), and when I use search term #tag it gives me only the match on the first token #tag1 but not on #tag1. I think what I need is a tokenizer that will produce word boundary tokens that inlcude special chars. Can someone suggest a solution?

Upvotes: 1

Views: 3779

Answers (2)

Sasha
Sasha

Reputation: 43

Thanks to @Kaveh suggestion, I found my mistake. My custom analyzer (with lots of filters, etc) was using standard tokenizer which I thought is similar to whitespace tokenizer. Once I switched to whitespace tokenizer in my custom analyzer I can see that the analyzer doesn't strip # from the beginning of the words anymore, and I can search on patterns started with # using simple_query_string query type.

Upvotes: 1

Kaveh
Kaveh

Reputation: 1310

If you want to include # in your search, you should use different analyzer than standard analyzer because # will be removed during analyze phase. You can use whitespace analyzer to analyze your text field. Also for search you can use wildcard pattern:

Query:

GET [Your index name]/_search
{
  "query": {
    "match": {
      "[FieldName]": "#tag*"
    }
  }
}

You can find information about Elastic built-in analyzer here.

UPDATE:

Whitespace analyzer:

POST /_analyze
{
  "analyzer": "whitespace",
  "text": "#tag1 quick #tag2"
}

Result:

{
  "tokens" : [
    {
      "token" : "#tag1",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "#tag2",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    }
  ]
}

As you can see #tag1 and #tag2 are two tokens.

Upvotes: 1

Related Questions