Jelmer Visser
Jelmer Visser

Reputation: 163

MongoDB match on document and subdocuments, what to use as indexes?

I have a lot of documents looking like this:

[{
        "title": "Luxe [daagse] [verzorging] @ Egypte! Incl. vluchten, transfers & 4* ho",
        "price": 433,
        "automatic": false,
        "destination": "5d26fc92f72acc7a0b19f2c4",
        "date": "2020-01-19T00:00:00.000+00:00",
        "days": 8,
        "arrival_airport": "5d1f5b407ec7385fa2963623",
        "departure_airport": "5d1f5adb7ec7385fa2963307",
        "board_type": "5d08e1dfff6c4f13f6db1e6c"
    },
    {
        "title": "Luxe [daagse] [verzorging] @ Egypte! Incl. vluchten, transfers & 4* ho",
        "automatic": true,
        "destination": "5d26fc92f72acc7a0b19f2c4",
        "prices": [{
                "price": 433,
                "date_from": "2020-01-19T00:00:00.000+00:00",
                "date_to": "2020-01-28T00:00:00.000+00:00",
                "day_count": 8,
                "arrival_airport": "5d1f5b407ec7385fa2963623",
                "departure_airport": "5d1f5adb7ec7385fa2963307",
                "board_type": "5d08e1dfff6c4f13f6db1e6c"
            },
            {
                "price": 899,
                "date_from": "2020-04-19T00:00:00.000+00:00",
                "date_to": "2020-04-28T00:00:00.000+00:00",
                "day_count": 19,
                "arrival_airport": "5d1f5b407ec7385fa2963623",
                "departure_airport": "5d1f5adb7ec7385fa2963307",
                "board_type": "5d08e1dfff6c4f13f6db1e6c"
            }
        ]
    }
]

As you can see, automatic deals have multiple prices (can be a lot, between 1000 and 4000) and does not have the original fields available.

Now I need to search in the original document as well in the subdocuments to look for a match.

This is the aggregation I use to search through the documents:

[{
    "$match": {
        "destination": {
            "$in": ["5d26fc9af72acc7a0b19f313"]
        }
    }
}, {
    "$match": {
        "$or": [{
            "prices": {
                "$elemMatch": {
                    "price": {
                        "$lte": 1500,
                        "$gte": 400
                    },
                    "date_to": {
                        "$lte": "2020-04-30T22:00:00.000Z"
                    },
                    "date_from": {
                        "$gte": "2020-03-31T22:00:00.000Z"
                    },
                    "board_type": {
                        "$in": ["5d08e1bfff6c4f13f6db1e68"]
                    }
                }
            }
        }, {
            "price": {
                "$lte": 1500,
                "$gte": 400
            },
            "date": {
                "$lte": "2020-04-30T22:00:00.000Z",
                "$gte": "2020-03-31T22:00:00.000Z"
            },
            "board_type": {
                "$in": ["5d08e1bfff6c4f13f6db1e68"]
            }
        }]
    }
}, {
    "$limit": 20
}]

I would like to speed things up, because it can be quite slow. I was wondering, what is the best index strategy for this aggregate, what fields do I use? Is this the best way of doing it or is there a better way?

Upvotes: 2

Views: 64

Answers (2)

prasad_
prasad_

Reputation: 14317

A compound index including the match filter fields is required to make the aggregation run fast. In aggregation queries, having the $match stage early in the pipeline (preferably, first stage) utilizes indexes, if any are defined on the filter fields. In the posted query it is so, and defining the indexes is all needed for a fast query. But, index on what fields?

The index is going to be compound index; i.e., index on multiple fields of the query criteria. The index prefix starts with the destination field. The remaining index fields are to be determined. What are the remaining fields?

Most of these fields are in the prices array's sub-document fields - price, date_from, date_to and board_type. There is also the date field from the main document. Which of these fields need to be used in the compound index?

Defining indexes on array elements (or fields of sub-documents in an array) creates lots of index keys. This means lots of storage and for using the index the memory (or RAM). This is an important consideration. Indexes on array elements are called as multikey indexes. For an index to be properly utilized, the collection's documents and the index being used by the query (together called as working set) must fit into the RAM.

Another aspect you need to consider is the query selectivity. How many documents gets selected using a filter which uses an index field, is a factor. It is imperative that the filter field with must select a small set of the input documents to be effective. See Create Queries that Ensure Selectivity.

It is difficult to determine what other fields need to be considered (sure some of the fields of the prices) based on the above two factors. So, the index is going to be something like this:

{ destination: 1, fld1: 1, fld2: 1, ... }

The fld1, fld2, ..., are going to be the prices array sub-document fields and / or the date field. I think only one set of date fields can be used with the index. An example index can be one of these:

{ destination: 1, date: 1,  "prices.price": 1, "prices.board_type": 1}
{ destination: 1, "prices.price": 1, "prices.date_from": 1, "prices.date_to": 1, "prices.board_type": 1}

Note the index keys order and the necessity of the price, date_from, date_to and board_type is to be determined based upon the two main factors - requirement of the working set and the query selectivity - this is important.

NOTES: On a small sample data set with similar structure showed usage of the compound index with the primary destination field and two fields from the prices (one with equality condition and one with range condition). The query plan using the explain showed an IXSCAN (index scan) on the compound index, and using an index will sure improve the query performance.

Upvotes: 2

Tom Slabbaert
Tom Slabbaert

Reputation: 22296

From Mongo's $or docs:

When evaluating the clauses in the $or expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an $or expression, all the clauses in the $or expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.

So with that in mind in order to avoid a collection scan in this pipeline you have to create a compound index containing both price and prices fields.

Remember that order matters in compound indexes so the order of the field should vary depending on your possible usage of it.

It seems to me that the index you want to create looks something like:

{destination: 1, date: 1, board_type: 1, price: 1, prices: 1}

Upvotes: 3

Related Questions