user3270454
user3270454

Reputation:

Elasticsearch - Count of matches per document

I'm using this query to search a field for occurrences of phrases.

"query": {
    "match_phrase": {
       "content": "my test phrase"
  }
 }

I need to calculate how many matches occurred for each phrase per document (if this is even possible?)

I've considered aggregators but think these don't meet the requirements as these will give me the number of matches over the whole index not per document.

Thanks.

Upvotes: 9

Views: 5158

Answers (2)

Polynomial Proton
Polynomial Proton

Reputation: 5135

This can be achieved by using Script Fields /painless script.

You can count the number of occurrences per field and add it up for the document.

Example:

## Here's my test index with some sample values

POST t1/doc/1  <-- this has one occurence
{
  "content" : "my test phrase"
}

POST t1/doc/2    <-- this document has 5 occurences
{
   "content": "my test phrase ",
   "content1" : "this is my test phrase 1",
   "content2" : "this is my test phrase 2",
   "content3" : "this is my test phrase 3",
   "content4" : "this is my test phrase 4"

}

POST t1/doc/3
{
  "content" : "my test new phrase"
}

Now using the script I can count the phrase match for each field. I'm counting it once per field, but you can modify script to multi match per field.

Obviously, the Drawback here is that you need to mention each and every field from the document in the script, unless there's a way to loop through doc field that i am not aware of.

POST t1/_search
{
  "script_fields": {
    "phrase_Count": {
      "script": {
        "lang": "painless",
        "source": """
                             int count = 0;

                            if(doc['content.keyword'].size() > 0 && doc['content.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content1.keyword'].size() > 0 && doc['content1.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content2.keyword'].size() > 0 && doc['content2.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content3.keyword'].size() > 0 && doc['content3.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content4.keyword'].size() > 0 && doc['content4.keyword'].value.indexOf(params.phrase)!=-1) count++;

                            return count;
""",
        "params": {
          "phrase": "my test phrase"
        }
      }
    }
  }
}

This will give me the phrase count per document as a scripted field

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            5                 <--- count of occurrences of the phrase in the document
          ]
        }
      },
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            1
          ]
        }
      },
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            0
          ]
        }
      }
    ]
  }
}

Upvotes: 7

Abdullah Ahsan
Abdullah Ahsan

Reputation: 107

You can use Term Vectors to achieve this functionality. Please have a look Term Vectors

Upvotes: -1

Related Questions