Reputation: 7030
I am trying to better understand internals of ElasticSearch, so I would like to know if there are any differences in how ElasticSearch internally computes term statistics for the following two cases.
The first case is when I have documents like:
{
"foo": [
{
"bar": "long string"
},
{
"bar": "another long string"
}
]
}
Or a document like:
{
"foobar": "long string another long string"
}
My understanding is that the first document gets flattened to:
{
"foo.bar": ["long string", "another long string"]
}
So it seems the question is really, is the second and third documents indexed the same? Is term statistics computed the same?
Upvotes: 3
Views: 368
Reputation: 217314
Interesting question! If you index the first and the second document and then look at the term vectors for the foo.bar
field, you'll notice that frequencies and offsets are exactly the same, however, the positions differ.
The reason for this has to do with the position_increment_gap
setting, whose default value is 100. The reason this fake gap is introduced is to prevent phrase queries from matching across the values.
So in the first document, the foo.bar
field has multiple values as you rightly noticed, hence why the term positions differ with the second document where there's only a single string.
["long string", "another long string"]
That means that if you try to use a match_phrase
query for matching string another
, then it won't match the first document, only the second one.
You can still decide to change the value of position_increment_gap
in the mapping of the first document and set it to 0, in which case both document would be indexed exactly the same way.
Upvotes: 3