Discombobulous
Discombobulous

Reputation: 1184

Obtain number of matches in a single field from elasticsearch

I want to obtain the number the number of matches a term appears in one of my hits along with the search results. (e.g., I want to know that "hello" appeared in "hello hi hello" 2 times).

However, my problem is even trickier because I want to use the soundex as a filter. (e.g., If I search for "great" and it matched with "test test grate that great". Then I want to know that my match appeared 2 times because "great" is phonetically identical "grate"

Here is what my index looks like:

{
    "lecture" : {  
        "properties" : {  
            "transcript" : {  
                "type" : "string",
                "analyzer" : "lecture_analyzer"
             },
            "file_id" : {
                "type" : "string"
            }
        }
    }
}

The lecture_analyzer looks like this:

{
    "tokenizer":  "standard",
    "filter": [
        "dbl_metaphone",
    ]
}

dbl_metaphone is what I use for phonetic matching

Now when I issue the following query:

"query" : {
    "bool" : {
         "must" : [
              {"match": { "transcript" :"grate"}},
              {"term": { "file_id" : "21648371" }}
         ]
     }
}

I get the following result:

{
  ...
  "hits" : {
    "total" : 1,
    "max_score" : 3.519093,
    "hits" : [ {
      ...
      "_id" : "21648371",
      "_score" : 3.519093,
      "_source" : {
        "transcript" : "ok that's great, grate that carrot please",
        "file_id" : "21648371"
      }
    } ]
  }
}

However, I want to know that my term "grate" appeared twice in my hit: once for "grate", and once for "great" due to the dbl_metaphone filter I used.

Does anyone know how to do this?

Upvotes: 0

Views: 1566

Answers (1)

ymonad
ymonad

Reputation: 12100

This is not so tricky but simple term counting problem. First you have to check how "great" and "grate" are translated to terms using double metaphone.

$ curl -s 'http://<HOST>/<INDEX>/_analyze?analyzer=lecture_analyzer&pretty' \
-d "ok that's great, grate that carrot please" \
| grep '"token"'

You get

"token" : "AK",
"token" : "0TS",
"token" : "TTS",
"token" : "KRT",
"token" : "KRT",
"token" : "0T",
"token" : "TT",
"token" : "KRT",
"token" : "PLS",

So you can see that "great", "grate" and also "carrot" are encoded to "KRT" using double metaphone.

Next, how to count the number of match in the document is not so easy problem for Elasticsearch.

One method is using Script Field. You can get the term frequency by _index[FIELD][TERM].tf() when using Groovy as script language.

Note that you have to set script.engine.groovy.inline.search: on in elasticsearch.yml to enable groovy scripting.

Reference: Scripting, Advanced Scripting

Here's the actual query.

{"query":{
    ... WRITE_YOUR_QUERY_HERE ...
  },
 "script_fields":{
   "transcript_count":{
     "script":"_index['transcript']['KRT'].tf()"
   }
 }
 "_source":["*"],
}

You get something like

"fields":{
  "transcript_count":[3]
}

Second method, which does not use groovy, is using term vectors.

The detail is written in following question so I would omit. How can I get total count of each words in elasticsearch document?

Third method is using Explain API. Since you are searching with {"match": { "transcript" :"grate"}}, you can get something like following using Explain API in the search result.

"description" : "tf(freq=3.0), with freq of:",
"details" : [ {
"value" : 3.0,
"description" : "termFreq=3.0",
"details" : [ ]
} ]

It shows the detail of how to calculate the score, and the term frequency of "KRT" is displayed in the intermediate result.

First method is not so good for security reason. Second and third is not so good for performance. For fulfilling both security and performance, you may need to write your own plugin. See Native Java Scripts

Upvotes: 1

Related Questions