Simple elasticsearch regexp

Question

I'm trying to write a query to will give me all the documents where the field "id" is of the form: "SOMETHING-SOMETHING-4SOMETHING-SOMETHING-SOMETHING"

For instance, ab-ba-4a-b-a is a valid id.

I wrote this query

  "query": 
  {
    "regexp": 
    {
      "id":
      {
        "value": ".*-.*-4.*-.*-.*"
      }
    }
  }

It gets no hits. What's wrong with this? I can see many ids of this form.

Kamal Kunjapur · Accepted Answer

If the id field is of type keyword the regexp should be working fine.

However if it is of type text, notice how elasticsearch stores the token internally.

POST /_analyze
{
  "text": "abc-abc-4bc-abc-abc",
  "analyzer": "standard"
}

Response:

{
  "tokens" : [
    {
      "token" : "abc",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "abc",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "4bc",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "abc",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "abc",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "",
      "position" : 4
    }
  ]
}

Notice that it breaks down the token abc-abc-4abc-abc-abc into 5 strings. Take a look at what Analysis and Analyzers are and how they are only applied on text fields.

However, keyword datatype has been created only for the cases where you do not want your text to be analyzed (i.e. broken into tokens and stored in inverted indexes) and stores the string value as it is internally.

Now just in case if your mapping is dynamic, ES by default creates two different fields for string values. a text and its keyword sibling, something like below:

{
    "mappings" : {
      "properties" : {
        "id" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }

In that case, just apply the query you have on id.keyword field.

POST /_search
{
  "query": {
    "regexp": {
      "id.keyword": ".*-.*-4.*-.*-.*"
    }
  }
}

Hope that helps!

Simple elasticsearch regexp

Answers (1)

Response:

Related Questions