o139
o139

Reputation: 874

Elasticsearch - How to get popular words list of documents

I have a temporary index with documents that I need to moderate. I want to group these documents by the words they contain.

For example, I have these documents:

1 - "aaa bbb ccc ddd eee fff"

2 - "bbb mmm aaa fff xxx"

3 - "hhh aaa fff"

So, I want to get the most popular words, ideally with counts: "aaa" - 3, "fff" - 3, "bbb" - 2, etc.

Is this possible with elasticsearch?

Upvotes: 19

Views: 17070

Answers (2)

Aron Fiechter
Aron Fiechter

Reputation: 142

It might be because this question and the accepted answer are some years old, but now there is a better way.

The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on.

This is usually the case for fields that contain data of type text and not keyword.

This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation.
From the docs:

  • It is specifically designed for use on type text fields
  • It does not require field data or doc-values
  • It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.

It can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler.

So, in your case you would send a query like this (leaving out the filtering/sampling):

{
    "aggs": {
        "keywords": {
            "significant_text": {
                "field": "myfield"
            }
        }
    }
}

Upvotes: 11

Olly Cruickshank
Olly Cruickshank

Reputation: 6180

Doing a simple term aggregation search will meet your needs:

(where mydata is the name of your field)

curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
  "query": {
    "match_all" : {}
  },
  "aggs" : {
      "mydata_agg" : {
    "terms": {"field" : "mydata"}
    }
  }
}'

will return:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "mydata_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "aaa",
        "doc_count" : 3
      }, {
        "key" : "fff",
        "doc_count" : 3
      }, {
        "key" : "bbb",
        "doc_count" : 2
      }, {
        "key" : "ccc",
        "doc_count" : 1
      }, {
        "key" : "ddd",
        "doc_count" : 1
      }, {
        "key" : "eee",
        "doc_count" : 1
      }, {
        "key" : "hhh",
        "doc_count" : 1
      }, {
        "key" : "mmm",
        "doc_count" : 1
      }, {
        "key" : "xxx",
        "doc_count" : 1
      } ]
    }
  }
}

Upvotes: 19

Related Questions