Exorcismus
Exorcismus

Reputation: 2482

Elasticsearch analyzer is working in API but not working in search query

I have created an analyzer and set it in settings and mapping

{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "synonym_analyzer": {
                        "tokenizer": "standard",
                        "filter": [
                            "lowercase"
                        ]
                    },
                    "regex_analyzer": {
                        "tokenizer": "regex_tokenizer",
                        "filter": [
                            "lowercase"
                        ]
                    }
                },
                "tokenizer": {
                    "regex_tokenizer": {
                        "type": "pattern",
                        "pattern": "((\\b|\\s|\\.|,)[a-z](\\b|\\s |\\.|,)){3,}",
                        "group": 0
                    }
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "transcript_data": {
                "properties": {
                    "transcript": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword"
                            },
                            "regex": {
                                "type": "text",
                                "analyzer": "regex_analyzer",
                                "search_analyzer": "regex_analyzer"
                            }
                        }
                    }
                }
            }
        }
    }
}

it works if I test it with calling the API directly and it displays the correct tokens in an array

POST myIndex/_analyze
{
  "analyzer": "regex_analyzer",
  "text": " this article is talking about l a z r and b k k t ...."
}

RESPONSE

{
  "tokens" : [
    {
      "token" : " b k k t",
      "start_offset" : 7971,
      "end_offset" : 7979,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " l a z r",
      "start_offset" : 8350,
      "end_offset" : 8358,
      "type" : "word",
      "position" : 1
    }
  ]
}

but if I query the index using the below query ... it just returns an array with the whole text in the field attribute

GET myIndex/_search
{
  "query": {
   "match_all": {

   }
  },
  "fields": [
    "transcript_data.transcript.regex"
  ]
}

RESPONSE

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "tickers",
        "_type" : "_doc",
        "_id" : "46",
        "_score" : 1.0,
        "_routing" : "1",
        "_source" : {
          "doc_type" : "post",
          "transcript_data" : {
            "transcript" : "this article is talking about l a z r and b k k t ....",
          },
          "join_field" : {
            "name" : "video",
            "parent" : "anonymouse"
          }
        },
        "fields" : {
          "transcript_data.transcript.regex" : [
            " this article is talking about l a z r and b k k t ...."
          ]
        }
      }
    ]
  }
}

I was expecting the array "transcript_data.transcript.regex" to be same as the one returned from the API

Upvotes: 1

Views: 56

Answers (1)

rabbitbr
rabbitbr

Reputation: 3261

With script_fields you can get the values indexed in the regex field, however you will have a high memory consumption if your index is too large. Remembering that you will have to activate the fielddata in the field.

{
  "script_fields": {
    "my_doubled_field": {
      "script": {
        "source": "doc['transcript_data.transcript.regex']"
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

Upvotes: 1

Related Questions