user1005679
user1005679

Reputation: 25

Return number of documents based on words in a field's string

How can I return the number of documents that have more than 2 elements in the "words" list with more than 3 words in "word_combination". Is there a way to count the number of words in a string?

Example: return document if (the length of "words" > 2) AND ("words.word_combination" has more than 3 words)

I have many documents stored. One document's structure looks like this:

"_source" : {
"group_words" : [

  {
    "amount" : 1140,
    "words" : [
      {
        "relevance_score" : 56,
        "points" : 66461,
        "bits" : 100,
        "word_combination" : "cat dog"
      },
      {
        "relevance_score" : 84,
        "points" : 45202,
        "bits" : 990,
        "word_combination" : "cat dog elephant"
      },
      {
        "relevance_score" : 99,
        "points" : 30974,
        "bits" : 70,
        "word_combination" : "elephant cat mouse leopard"
      }
    ],
    "group" : "whatever"
  },
  {
    "amount" : 1320,
    "words" : [
      {
        "relevance_score" : 25,
        "points" : 53396,
        "bits" : 70,
        "word_combination" : "lion elephant"
      },
      {
        "relevance_score" : 66,
        "points" : 52166,
        "bits" : 20,
        "word_combination" : "lion mouse fish cat dog"
      },
      {
        "relevance_score" : 82,
        "points" : 49316,
        "bits" : 810,
        "word_combination" : "elephant cat mouse leopard dog lion"
      },
      {
        "relevance_score" : 87,
        "points" : 127705,
        "bits" : 290,
        "word_combination" : "elephant cat mouse leopard tiger lion"
      }
    ],
    "group" : "whatever"
  },
  {
    "amount" : 11260,
    "words" : [
      {
        "relevance_score" : 0,
        "points" : 37909,
        "bits" : 9000,
        "word_combination" : "elephant cat mouse leopard tiger lion monkey"
      },
      {
        "relevance_score" : 3,
        "points" : 35782,
        "bits" : 540,
        "word_combination" : "elephant"
      }
    ],
    "group" : "whatever"
  }      
]

}

Upvotes: 1

Views: 44

Answers (1)

Val
Val

Reputation: 217274

Regarding the number of elements in the words array, my advice is to store that number in an additional field words_count at indexing time.

  {
    "amount" : 1140,
    "words_count": 3,                           <--- add this
    "words" : [
      {
        "relevance_score" : 56,
        "points" : 66461,
        "bits" : 100,
        "word_combination" : "cat dog"
      },
      {
        "relevance_score" : 84,
        "points" : 45202,
        "bits" : 990,
        "word_combination" : "cat dog elephant"
      },
      {
        "relevance_score" : 99,
        "points" : 30974,
        "bits" : 70,
        "word_combination" : "elephant cat mouse leopard"
      }
    ],
    "group" : "whatever"
  },

Concerning the number of words (or tokens) in the word_combination field, there's a data type called token_count which exists exactly for this purpose. Simply define your mapping like this:

...
"word_combination": {
  "type": "text",
  "fields": {
    "count": {
      "type": "token_count",
      "analyzer": "standard"
    }
  }
}

Then in your query you can access word_combination.count which is going to contain the number of tokens (as analyzed by the specified analyzer) present in the word_combination field.

Upvotes: 1

Related Questions