Elasticsearch: Count terms in document

Question

I'm fairly new to elasticsearch, use version 6.5. My database contains website pages and their content, like this:

Url      Content
abc.com  There is some content about cars here. Lots of cars!
def.com  This page is all about cars.
ghi.com  Here it tells us something about insurances.
jkl.com  Another page about cars and how to buy cars.

I have been able to perform a simple query that returns all documents that contain the word "cars" in their content (using Python):

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}}, 
    "from": 0, "size": 100})

Result looks something like this:

{'took': 2521, 
'timed_out': False, 
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index': 
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571, 
'_source': {'content': '....'}}]}}

The "_id"s are referring to a domain, so I basically get back:

abc.com
def.com
jkl.com

But I now want to know how often the searchterm ("cars") is present in each document, like:

abc.com: 2
def.com: 1
jkl.com: 2

I found several solutions how to obtain the number of documents that contain the searchterm, but none that would tell how to get the number of terms in a document. I also couldn't find anything in the official documentation, although I'm pretty sure is in there somewhere and I'm maybe just not realising that it is the solution for my problem.

Update:

As suggested by @Curious_MInd I tried term aggregation:

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content" 
}}}})

Result:

{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful': 
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0, 
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252', 
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations': 
{'skala_count': {'doc_count_error_upper_bound': 0, 
'sum_other_doc_count': 0, 'buckets': []}}}

I don't see where it would display the counts per document here, but I'm assuming that's because "buckets" is empty? On another note: The results found by term aggregation are significantly worse than those with multi_match query. Is there any way to combine those?

Nishant · Accepted Answer

What you are trying to achieve can't be done in a single query. The first query will be to filter and get the doc Ids for which the terms counts is required. Lets assume you have the following mapping:

{
  "test": {
    "mappings": {
      "_doc": {
        "properties": {
          "details": {
            "type": "text",
            "store": true,
            "term_vector": "with_positions_offsets_payloads"
          },
          "name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Assuming you query returns the following two docs:

{
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "details": "There is some content about cars here. Lots of cars!",
          "name": "n1"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "details": "This page is all about cars",
          "name": "n2"
        }
      }
    ]
  }
}

From the above response you can get all the document ids that matched your query. For above we have : "_id": "1" and "_id": "2"

Now we use _mtermvectors api to get the frequency(count) of each term in a given field:

test/_doc/_mtermvectors
{
  "docs": [
    {
      "_id": "1",
      "fields": [
        "details"
      ]
    },
    {
      "_id": "2",
      "fields": [
        "details"
      ]
    }
  ]
}

The above returns the following result:

{
  "docs": [
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 8,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 2,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 28,
                  "end_offset": 32
                },
                {
                  "position": 9,
                  "start_offset": 47,
                  "end_offset": 51
                }
              ]
            },
            ....
          }
        }
      }
    },
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 2,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 23,
                  "end_offset": 27
                }
              ]
            },
            ....
        }
      }
    }
  ]
}

Note that I have used .... to denote other terms data in the field since the term vector api return the term related details for all the terms. You can definitely extract the info about the required term from the above response, here I have shown for cars and the field you are interested in is term_freq

Elasticsearch: Count terms in document

Answers (2)

Related Questions