Abhay Bhargav
Abhay Bhargav

Reputation: 399

Elasticsearch - Cardinality over Full Field Value

I have a document that looks like this:

{
   "_id":"some_id_value",
   "_source":{
      "client":{
         "name":"x"
      },
      "project":{
         "name":"x November 2016"
      }
   }
}

I am attempting to perform a query that will fetch me the count of unique project names for each client. For this, I am using a query with cardinality over the project.name. I am sure that there are only 4 unique project names for this particular client. However, when I run my query, I get a count of 5, which I know is wrong.

The project names all contain the name of the client. For instance, if a client is "X", project names will be "X Testing November 2016", or "X Jan 2016", etc. I don't know if that is a consideration.

This is the mapping for the document type

{
   "mappings":{
      "vma_docs":{
         "properties":{
            "client":{
               "properties":{
                  "contact":{
                     "type":"string"
                  },
                  "name":{
                     "type":"string"
                  }
               }
            },
            "project":{
               "properties":{
                  "end_date":{
                     "format":"yyyy-MM-dd",
                     "type":"date"
                  },
                  "project_type":{
                     "type":"string"
                  },
                  "name":{
                     "type":"string"
                  },
                  "project_manager":{
                     "index":"not_analyzed",
                     "type":"string"
                  },
                  "start_date":{
                     "format":"yyyy-MM-dd",
                     "type":"date"
                  }
               }
            }
         }
      }
   }
}

This is my search query

{
   "fields":[
      "client.name",
      "project.name"
   ],
   "query":{
      "bool":{
         "must":{
            "match":{
               "client.name":{
                  "operator":"and",
                  "query":"ABC systems"
               }
            }
         }
      }
   },
   "aggs":{
      "num_projects":{
         "cardinality":{
            "field":"project.name"
         }
      }
   },
   "size":5
}

These are the results I get (I have only posted 2 results for the sake of brevity). Please find that the num_projects aggregation returns 5, but must only return 4, which are the total number of projects.

{
   "hits":{
      "hits":[
         {
            "_score":5.8553367,
            "_type":"vma_docs",
            "_id":"AVTMIM9IBwwoAW3mzgKz",
            "fields":{
               "project.name":[
                  "ABC"
               ],
               "client.name":[
                  "ABC systems Pvt Ltd"
               ]
            },
            "_index":"vma"
         },
         {
            "_score":5.8553367,
            "_type":"vma_docs",
            "_id":"AVTMIM9YBwwoAW3mzgK2",
            "fields":{
               "project.name":[
                  "ABC"
               ],
               "client.name":[
                  "ABC systems Pvt Ltd"
               ]
            },
            "_index":"vma"
         }
      ],
      "total":18,
      "max_score":5.8553367
   },
   "_shards":{
      "successful":5,
      "failed":0,
      "total":5
   },
   "took":4,
   "aggregations":{
      "num_projects":{
         "value":5
      }
   },
   "timed_out":false
}

FYI: The project names are ABC, ABC Nov 2016, ABC retest November, ABC Mobile App

Upvotes: 1

Views: 334

Answers (1)

Andrei Stefan
Andrei Stefan

Reputation: 52368

You need the following mapping for your project.name field:

{
  "mappings": {
    "vma_docs": {
      "properties": {
        "client": {
          "properties": {
            "contact": {
              "type": "string"
            },
            "name": {
              "type": "string"
            }
          }
        },
        "project": {
          "properties": {
            "end_date": {
              "format": "yyyy-MM-dd",
              "type": "date"
            },
            "project_type": {
              "type": "string"
            },
            "name": {
              "type": "string",
              "fields": {
                "raw": {
                  "type": "string",
                  "index": "not_analyzed"
                }
              }
            },
            "project_manager": {
              "index": "not_analyzed",
              "type": "string"
            },
            "start_date": {
              "format": "yyyy-MM-dd",
              "type": "date"
            }
          }
        }
      }
    }
  }
}

It's basically a subfield called raw where the same value put in project.name is put in project.name.raw but without touching it (tokenizing or analyzing it). And then the query you need to use is:

{
  "fields": [
    "client.name",
    "project.name"
  ],
  "query": {
    "bool": {
      "must": {
        "match": {
          "client.name": {
            "operator": "and",
            "query": "ABC systems"
          }
        }
      }
    }
  },
  "aggs": {
    "num_projects": {
      "cardinality": {
        "field": "project.name.raw"
      }
    }
  },
  "size": 5
}

Upvotes: 2

Related Questions