Reputation: 692
I am trying to crack the elasticsearch query language, and so far I'm not doing very good.
I've got the following mapping for my documents.
{
"mappings": {
"jsondoc": {
"properties": {
"header" : {
"type" : "nested",
"properties" : {
"plainText" : { "type" : "string" },
"title" : { "type" : "string" },
"year" : { "type" : "string" },
"pages" : { "type" : "string" }
}
},
"sentences": {
"type": "nested",
"properties": {
"id": { "type": "integer" },
"text": { "type": "string" },
"tokens": { "type": "nested" },
"rhetoricalClass": { "type": "string" },
"babelSynsetsOcc": {
"type": "nested",
"properties" : {
"id" : { "type" : "integer" },
"text" : { "type" : "string" },
"synsetID" : { "type" : "string" }
}
}
}
}
}
}
}
}
It mainly resembles a JSON file referring to a pdf document.
I have been trying to make queries with aggregations and so far is going great. I've gotten to the point of grouping by (aggregating) rhetoricalClass
, get the total number of repetitions of babelSynsetsOcc.synsetID
. Heck, even the same query even by grouping the whole result by header.year
But, right now, I am struggling with filtering the documents that contain a term and doing the same query.
So, how could I make a query such that grouping by rhetoricalClass
and only taking into account those documents whose field header.plainText
contains either ["Computational", "Compositional", "Semantics"]
. I mean contain
instead of equal
!.
If I were to make a rough translation to SQL it would be something similar to
SELECT count(sentences.babelSynsetsOcc.synsetID)
FROM jsondoc
WHERE header.plainText like '%Computational%' OR header.plainText like '%Compositional%' OR header.plainText like '%Sematics%'
GROUP BY sentences.rhetoricalClass
Upvotes: 0
Views: 231
Reputation: 22332
WHERE
clauses are just standard structured queries, so they translate to queries in Elasticsearch.
GROUP BY
and HAVING
loosely translate to aggregations in Elasticsearch's DSL. Functions like count
, min
max
, and sum
are a function of GROUP BY
and it's therefore also an aggregation.
The fact that you're using nested
objects may be necessary, but it adds an extra layer to each part that touches them. If those nested
objects are not arrays, then do not use nested
; use object
in that case.
I would probably look at translating your query to:
{
"query": {
"nested": {
"path": "header",
"query": {
"bool": {
"should": [
{
"match": {
"header.plainText" : "Computational"
}
},
{
"match": {
"header.plainText" : "Compositional"
}
},
{
"match": {
"header.plainText" : "Semantics"
}
}
]
}
}
}
}
}
Alternatively, it could be rewritten as this, which is a little less obvious of its intent:
{
"query": {
"nested": {
"path": "header",
"query": {
"match": {
"header.plainText": "Computational Compositional Semantics"
}
}
}
}
}
The aggregation would then be:
{
"aggs": {
"nested_sentences": {
"nested": {
"path": "sentences"
},
"group_by_rhetorical_class": {
"terms": {
"field": "sentences.rhetoricalClass",
"size": 10
},
"aggs": {
"nested_babel": {
"path": "sentences.babelSynsetsOcc"
},
"aggs": {
"count_synset_id": {
"count": {
"field": "sentences.babelSynsetsOcc.synsetID"
}
}
}
}
}
}
}
}
Now, if you combine them and throw away hits (since you're just looking for the aggregated result), then it looks like this:
{
"size": 0,
"query": {
"nested": {
"path": "header",
"query": {
"match": {
"header.plainText": "Computational Compositional Semantics"
}
}
}
},
"aggs": {
"nested_sentences": {
"nested": {
"path": "sentences"
},
"group_by_rhetorical_class": {
"terms": {
"field": "sentences.rhetoricalClass",
"size": 10
},
"aggs": {
"nested_babel": {
"path": "sentences.babelSynsetsOcc"
},
"aggs": {
"count_synset_id": {
"count": {
"field": "sentences.babelSynsetsOcc.synsetID"
}
}
}
}
}
}
}
}
Upvotes: 1