Jesus Gomez
Jesus Gomez

Reputation: 1520

Mapping with thousands of fields design options

I am trying to design a mapping for a general purpose list of "terms" that have a label and value as this:

terms = [
  { label: "Start Date", value: "2017/12/11" }, <- this is a date
  { label: "End Date", value: "2027/12/11" }, 
  { label: "Owner", value: "Monsters INC." }, <- this is text
  { label: "Fees", value: "1000$" } <- this is a numeric field
]

while all documents will share several common fields, I have several different document templates and users will be able to add custom terms to the list with different data types.

I need to query documents using some boolean logic like "get those documents where the start date is last year and fees are less than 1000$ and owner is "monster INC."

I have a quite big list of terms (thousands) and several more can be added by users or are added by the development team.

I have explored two solutions to this problem:

Storing as a nested object:

the mapping looks as so:

"terms":

                {
                    "type": "nested",
                    "properties": {
                        "label": { "type": "string" },
                        "value": { "type": "string" },
                        "source": { "type": "string" },
                        "page": { "type": "string" }
                    }
                }

Pros: No need to remake the index when new terms are added, smaller mapping

Cons:

Queries are harder since we need to check what the label is related to the value.

Since all values are strings there is no way to use lt, gt

It might be possible implement lt, gt using casting BUT it seemsslow (defeats the purpose of ES)

Creating a big mapping:

just create a big object with every single possible term:

{
   "Start Date": { "type": "date" },
   "End Date": { "type": "date" },
   "Owner": { "type": "text" },
   "Fees": { "type": "integer" },
    ... add as many terms as needed
}

Pros: queries become straightforward, can do gt, lt, can apply any needed optimization to each fields (like exact fields, keywords fields, etc)

Cons: big, esparce mappings are not recommended by ES since every document shares the same underlying data structure.

More work keeping the term list updated

Terms with the same name might clash if they have different data types

Is there any solution to this pattern offered by ES? Any help appreciated.

WE ARE CURRENTLY USING ES 5.5 There are currently 1400 terms in the term dictionary

Upvotes: 1

Views: 102

Answers (1)

moxn
moxn

Reputation: 1800

Assuming you know the type of your terms and when you index and search, you could encode the type of the value in the name and use dynamic templates with pattern matching. You would just have to build a projection of a label ("Start date") to a property name with encoded type ("start_date_date") and write the label as string and the dynamically typed value into it, so you can map everything matching a pattern (*_date) to some specific type

Edit: Dynamic templates are part of Elasticsearch. With this, you can define a template for a mapping that will be applied when e.g. the field name matches a certain pattern.

{"terms: {
  "dynamic_templates": [
    "date_term": {
      "match_mapping_type": "string",
      "match": "*_date",
      "mapping": {
        "type": "date"
      }
    },
    "numeric_term": {
      "match_mapping_type": "string",
      "match": "*_number",
      "mapping": {
        "type": "long"
      }
    }
  ]
}}

This snippet would use the date type for start_date_date and long for start_date_number if you supply the value in both cases as string (that is what the match_mapping_type is for). In case you supply it as a double, Elasticsearch's own dynamic mapping (if enabled) would already take care of mapping it to a double.

Upvotes: 1

Related Questions