Reputation: 41
I am trying to get the total number of tokens in documents that match a query. I haven't defined any custom mapping and the field for which I want to get the token count is of type 'string'.
I tried the following query, but it gives a very large number in the order of 10^20, which is not the correct answer for my dataset.
curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
"query": {
"match_all": {}
},
"aggs": {
"tk_count": {
"sum": {
"script": "_index[\"body\"].sumttf()"
}
}
},
"size": 0
}
Any idea how to get the correct count of all tokens? ( I do not need counts for each term, but the total count).
Upvotes: 2
Views: 2331
Reputation: 21064
This worked for me, is it what you need?
Rather than getting token count on query (using tk_count
aggregation, as suggested in the other answer), my solution stores the token count on indexing using the token_count
datatype., so that I could get "name.stored_length" values returned in query results.
token_count
is a "multi-field" it works on one-field-at-a-time (i.e. the "name" field or the "body" field). I modified the example slightly to store the "name.stored_length"
Notice in my example it does not count cardinality of tokens (i.e. distinct values), it counts total tokens; "John John Doe" has 3 tokens in it; "name.stored_length"===3; (even though its count distinct tokens is only 2). Notice I ask for specific "stored_fields" : ["name.stored_length"]
Finally, you may need to re-update your documents (i.e. send a PUT
), or any technique to get the values you want! In this case I PUT
"John John Doe", even if it was already POST/PUT
in elasticsearch; the tokens were not counted until a PUT
again, after adding tokens to the mapping.!)
PUT test_token_count
{
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"fields": {
"stored_length": {
"type": "token_count",
"analyzer": "standard",
//------------------v
"store": true
}
}
}
}
}
}
}
PUT test_token_count/_doc/1
{
"name": "John John Doe"
}
Now we can query, or search for results, and configure results to include the name.stored_length
field (which is both a multi-field and a stored field!):
GET/POST test_token_count/_search
{
//------------------v
"stored_fields" : ["name.stored_length"]
}
And results to the search should include the total token count as named.stored_length
...
{
...
"hits": {
...
"hits": [
{
"_index": "test_token_count",
"_type": "_doc",
"_id": "1",
"_score": 1,
"fields": {
//------------------v
"name.stored_length": [
3
]
}
}
]
}
}
Upvotes: 3
Reputation: 1666
Seems like you want to retrieve cardinality
of total tokens in body field.
In such case you can just use cardinality aggregation
like below.
curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
"query": {
"match_all": {}
},
"aggs": {
"tk_count": {
"cardinality" : {
"field" : "body"
}
}
},
"size": 0
}
For detailed information, see this official document
Upvotes: 2