Reputation: 2482
I have created an analyzer and set it in settings and mapping
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"regex_analyzer": {
"tokenizer": "regex_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"regex_tokenizer": {
"type": "pattern",
"pattern": "((\\b|\\s|\\.|,)[a-z](\\b|\\s |\\.|,)){3,}",
"group": 0
}
}
}
}
},
"mappings": {
"properties": {
"transcript_data": {
"properties": {
"transcript": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"regex": {
"type": "text",
"analyzer": "regex_analyzer",
"search_analyzer": "regex_analyzer"
}
}
}
}
}
}
}
}
it works if I test it with calling the API directly and it displays the correct tokens in an array
POST myIndex/_analyze
{
"analyzer": "regex_analyzer",
"text": " this article is talking about l a z r and b k k t ...."
}
RESPONSE
{
"tokens" : [
{
"token" : " b k k t",
"start_offset" : 7971,
"end_offset" : 7979,
"type" : "word",
"position" : 0
},
{
"token" : " l a z r",
"start_offset" : 8350,
"end_offset" : 8358,
"type" : "word",
"position" : 1
}
]
}
but if I query the index using the below query ... it just returns an array with the whole text in the field
attribute
GET myIndex/_search
{
"query": {
"match_all": {
}
},
"fields": [
"transcript_data.transcript.regex"
]
}
RESPONSE
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "tickers",
"_type" : "_doc",
"_id" : "46",
"_score" : 1.0,
"_routing" : "1",
"_source" : {
"doc_type" : "post",
"transcript_data" : {
"transcript" : "this article is talking about l a z r and b k k t ....",
},
"join_field" : {
"name" : "video",
"parent" : "anonymouse"
}
},
"fields" : {
"transcript_data.transcript.regex" : [
" this article is talking about l a z r and b k k t ...."
]
}
}
]
}
}
I was expecting the array "transcript_data.transcript.regex"
to be same as the one returned from the API
Upvotes: 1
Views: 56
Reputation: 3261
With script_fields you can get the values indexed in the regex field, however you will have a high memory consumption if your index is too large. Remembering that you will have to activate the fielddata in the field.
{
"script_fields": {
"my_doubled_field": {
"script": {
"source": "doc['transcript_data.transcript.regex']"
}
}
},
"query": {
"match_all": {}
}
}
Upvotes: 1