Internals of array of strings vs. concatenated string in ElasticSearch

Question

I am trying to better understand internals of ElasticSearch, so I would like to know if there are any differences in how ElasticSearch internally computes term statistics for the following two cases.

The first case is when I have documents like:

{
  "foo": [
    {
      "bar": "long string"
    },
    {
      "bar": "another long string"
    }
  ]
}

Or a document like:

{
  "foobar": "long string another long string"
}

My understanding is that the first document gets flattened to:

{
  "foo.bar": ["long string", "another long string"]
}

So it seems the question is really, is the second and third documents indexed the same? Is term statistics computed the same?

Val · Accepted Answer

Interesting question! If you index the first and the second document and then look at the term vectors for the foo.bar field, you'll notice that frequencies and offsets are exactly the same, however, the positions differ.

The reason for this has to do with the position_increment_gap setting, whose default value is 100. The reason this fake gap is introduced is to prevent phrase queries from matching across the values.

So in the first document, the foo.bar field has multiple values as you rightly noticed, hence why the term positions differ with the second document where there's only a single string.

["long string", "another long string"]

That means that if you try to use a match_phrase query for matching string another, then it won't match the first document, only the second one.

You can still decide to change the value of position_increment_gap in the mapping of the first document and set it to 0, in which case both document would be indexed exactly the same way.

Internals of array of strings vs. concatenated string in ElasticSearch

Answers (1)

Related Questions