ElasticSearch - terms aggregation split by whitespace

Question

I have a bunch of elastic search documents that contain information about jobs ads. I'm trying to aggregate the attributes.Title field to extract the number of "experience" instances from the job posting. e.g. Junior, Senior, Lead, etc. Instead what I'm getting are buckets that match the title as a whole instead of the each word it the title field. e.g. "Junior Java Developer", "Senior .NET Analyst", etc.

How can I tell elastic search to split the aggregation based on each word in the title as opposed the matching the value of the whole field.

I would later like to expand the query to also extract the "skill level" and "role", but it should also be fine if the buckets contain all the words in the field as long as they are split into separate buckets.

Current query:

GET /jobs/_search
{
  "query": {
    "simple_query_string" : {
        "query": "Java",
        "fields": ["attributes.Title"]
    }
  },
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "attributes.Title.keyword"
      }
    }
  }
}

Unwanted Output:

{
  ...
  "hits": {
    "total": 63,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_state": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 14,
      "buckets": [{
          "key": "Junior Java Tester",
          "doc_count": 6
        },{
          "key": "Senior Java Lead",
          "doc_count": 6
        },{
          "key": "Intern Java Tester",
          "doc_count": 5
        },
        ...
      ]
    }
  }
}

Desired Output:

{
  ...
  "hits": {
    "total": 63,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_state": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 14,
      "buckets": [{
          "key": "Junior",
          "doc_count": 12
        },{
          "key": "Senior",
          "doc_count": 8
        },{
          "key": "Tester",
          "doc_count": 5
        },{
          "key": "Intern",
          "doc_count": 5
        },{
          "key": "Analyst",
          "doc_count": 5
        },
        ...
      ]
    }
  }
}

yyssw · Accepted Answer

I'm inferring that your mapping type is keyword because you aggregated on a field called "attributes.Title.keyword". The keyword mapping will not tokenize your string so during aggregation time, it will treat the entire string as a unique key.

You want to update your mapping to type: "text" for the title field. I wouldn't call it title.keyword but something like title.analyzed -- if you don't specify an analyzer, Elasticsearch will apply the standard analyzer which should be enough to get you started. You can also use the whitespace analyzer if you only want your titles to be broken down by whitespace (instead of stemmed and some other stuff). You will get a lot of other words in your aggregation but I'm assuming that you're looking for these shared experience modifier tokens and based on frequency, they will rise to the top.

If you're using 5.x, make sure to set 'fielddata: true' since text fields aren't available for aggregation by default.

mapping:

"properties" : {
    "attributes" : {
        "properties" : {
            "title" : {
                "properties" : {
                    "keyword" : { "type" : "keyword" },
                    "analyzed" : { "type" : "text", "analyzer" : "whitespace", "fielddata" : true }
                }
            }
        }
    }
 }

ElasticSearch - terms aggregation split by whitespace

Answers (1)

Related Questions