fasaas
fasaas

Reputation: 692

ElasticSearch query on tags

I am trying to crack the elasticsearch query language, and so far I'm not doing very good.

I've got the following mapping for my documents.

{
    "mappings": {
        "jsondoc": {
            "properties": {
                "header" : {
                    "type" : "nested",
                    "properties" : {
                        "plainText" : { "type" : "string" },
                        "title" : { "type" : "string" },
                        "year" : { "type" : "string" },
                        "pages" : { "type" : "string" }
                    }
                },
                "sentences": {
                    "type": "nested",
                    "properties": {
                        "id": { "type": "integer" },
                        "text": { "type": "string" },
                        "tokens": { "type": "nested" },
                        "rhetoricalClass": { "type": "string" },
                        "babelSynsetsOcc": {
                            "type": "nested",
                            "properties" : {
                                "id" : { "type" : "integer" },
                                "text" : { "type" : "string" },
                                "synsetID" : { "type" : "string" }
                            }
                        }
                    }
                }
            }
        }
    }
}

It mainly resembles a JSON file referring to a pdf document.

I have been trying to make queries with aggregations and so far is going great. I've gotten to the point of grouping by (aggregating) rhetoricalClass, get the total number of repetitions of babelSynsetsOcc.synsetID. Heck, even the same query even by grouping the whole result by header.year

But, right now, I am struggling with filtering the documents that contain a term and doing the same query.

So, how could I make a query such that grouping by rhetoricalClass and only taking into account those documents whose field header.plainText contains either ["Computational", "Compositional", "Semantics"]. I mean contain instead of equal!.

If I were to make a rough translation to SQL it would be something similar to

SELECT count(sentences.babelSynsetsOcc.synsetID)
FROM jsondoc
WHERE header.plainText like '%Computational%' OR header.plainText like '%Compositional%' OR header.plainText like '%Sematics%'
GROUP BY sentences.rhetoricalClass

Upvotes: 0

Views: 231

Answers (1)

pickypg
pickypg

Reputation: 22332

WHERE clauses are just standard structured queries, so they translate to queries in Elasticsearch.

GROUP BY and HAVING loosely translate to aggregations in Elasticsearch's DSL. Functions like count, min max, and sum are a function of GROUP BY and it's therefore also an aggregation.

The fact that you're using nested objects may be necessary, but it adds an extra layer to each part that touches them. If those nested objects are not arrays, then do not use nested; use object in that case.

I would probably look at translating your query to:

{
  "query": {
    "nested": {
      "path": "header",
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "header.plainText" : "Computational"
              }
            },
            {
              "match": {
                "header.plainText" : "Compositional"
              }
            },
            {
              "match": {
                "header.plainText" : "Semantics"
              }
            }
          ]
        }
      }
    }
  }
}

Alternatively, it could be rewritten as this, which is a little less obvious of its intent:

{
  "query": {
    "nested": {
      "path": "header",
      "query": {
        "match": {
          "header.plainText": "Computational Compositional Semantics"
        }
      }
    }
  }
}

The aggregation would then be:

{
  "aggs": {
    "nested_sentences": {
      "nested": {
        "path": "sentences"
      },
      "group_by_rhetorical_class": {
        "terms": {
          "field": "sentences.rhetoricalClass",
          "size": 10
        },
        "aggs": {
          "nested_babel": {
            "path": "sentences.babelSynsetsOcc"
          },
          "aggs": {
            "count_synset_id": {
              "count": {
                "field": "sentences.babelSynsetsOcc.synsetID"
              }
            }
          }
        }
      }
    }
  }
}

Now, if you combine them and throw away hits (since you're just looking for the aggregated result), then it looks like this:

{
  "size": 0,
  "query": {
    "nested": {
      "path": "header",
      "query": {
        "match": {
          "header.plainText": "Computational Compositional Semantics"
        }
      }
    }
  },
  "aggs": {
    "nested_sentences": {
      "nested": {
        "path": "sentences"
      },
      "group_by_rhetorical_class": {
        "terms": {
          "field": "sentences.rhetoricalClass",
          "size": 10
        },
        "aggs": {
          "nested_babel": {
            "path": "sentences.babelSynsetsOcc"
          },
          "aggs": {
            "count_synset_id": {
              "count": {
                "field": "sentences.babelSynsetsOcc.synsetID"
              }
            }
          }
        }
      }
    }
  }
}

Upvotes: 1

Related Questions