Reputation: 1815
I'm fairly new to elasticsearch, use version 6.5. My database contains website pages and their content, like this:
Url Content
abc.com There is some content about cars here. Lots of cars!
def.com This page is all about cars.
ghi.com Here it tells us something about insurances.
jkl.com Another page about cars and how to buy cars.
I have been able to perform a simple query that returns all documents that contain the word "cars" in their content (using Python):
current_app.elasticsearch.search(index=index, doc_type=index,
body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}},
"from": 0, "size": 100})
Result looks something like this:
{'took': 2521,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index':
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571,
'_source': {'content': '....'}}]}}
The "_id"s are referring to a domain, so I basically get back:
But I now want to know how often the searchterm ("cars") is present in each document, like:
I found several solutions how to obtain the number of documents that contain the searchterm, but none that would tell how to get the number of terms in a document. I also couldn't find anything in the official documentation, although I'm pretty sure is in there somewhere and I'm maybe just not realising that it is the solution for my problem.
Update:
As suggested by @Curious_MInd I tried term aggregation:
current_app.elasticsearch.search(index=index, doc_type=index,
body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content"
}}}})
Result:
{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful':
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0,
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252',
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations':
{'skala_count': {'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0, 'buckets': []}}}
I don't see where it would display the counts per document here, but I'm assuming that's because "buckets" is empty? On another note: The results found by term aggregation are significantly worse than those with multi_match query. Is there any way to combine those?
Upvotes: 1
Views: 3530
Reputation: 7864
What you are trying to achieve can't be done in a single query. The first query will be to filter and get the doc Ids for which the terms counts is required. Lets assume you have the following mapping:
{
"test": {
"mappings": {
"_doc": {
"properties": {
"details": {
"type": "text",
"store": true,
"term_vector": "with_positions_offsets_payloads"
},
"name": {
"type": "keyword"
}
}
}
}
}
}
Assuming you query returns the following two docs:
{
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"details": "There is some content about cars here. Lots of cars!",
"name": "n1"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 1,
"_source": {
"details": "This page is all about cars",
"name": "n2"
}
}
]
}
}
From the above response you can get all the document ids that matched your query. For above we have : "_id": "1"
and "_id": "2"
Now we use _mtermvectors
api to get the frequency(count) of each term in a given field:
test/_doc/_mtermvectors
{
"docs": [
{
"_id": "1",
"fields": [
"details"
]
},
{
"_id": "2",
"fields": [
"details"
]
}
]
}
The above returns the following result:
{
"docs": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 8,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 2,
"tokens": [
{
"position": 5,
"start_offset": 28,
"end_offset": 32
},
{
"position": 9,
"start_offset": 47,
"end_offset": 51
}
]
},
....
}
}
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_version": 1,
"found": true,
"took": 2,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 1,
"tokens": [
{
"position": 5,
"start_offset": 23,
"end_offset": 27
}
]
},
....
}
}
}
]
}
Note that I have used ....
to denote other terms data in the field since the term vector api return the term related details for all the terms.
You can definitely extract the info about the required term from the above response, here I have shown for cars
and the field you are interested in is term_freq
Upvotes: 3
Reputation: 38502
I guess you need Term Aggregation here like below, See
GET /_search
{
"aggs" : {
"cars_count" : {
"terms" : { "field" : "Content" }
}
}
}
Upvotes: 1