Incredibly slow query performance with $lookup and "sub" aggregation pipeline

Let's say I have two collections, tasks and customers.

Customers have a 1:n relation with tasks via a "customerId" field in customers.

I now have a view where I need to display tasks with customer names. AND I also need to be able to filter and sort for customer names. Which means I can't do the $limit or $match stage before $lookup in the following query.

So here is my example query:

db.task.aggregate([
    {
        "$match": {
            "_deleted": false
        }
    },
    "$lookup": {
        "from": "customer",
        "let": {
            "foreignId": "$customerId"
        },
        "pipeline": [
            {
                "$match": {
                    "$expr": {
                        "$and": [
                            {
                                "$eq": [
                                    "$_id",
                                    "$$foreignId"
                                ]
                            },
                            {
                              "$eq": [
                                "$_deleted",
                                false
                              ]
                            }
                        ]
                    }
                }
            }
        ],
        "as": "customer"
    },
    {
        "$unwind": {
            "path": "$customer",
            "preserveNullAndEmptyArrays": true
            }
    },
    {
        "$match": {
            "customer.name": 'some_search_string'
        }
    },
    {
        "$sort": {
            "customer.name": -1
        }
    },
    {
        "$limit": 35
    },
    {
        "$project": {
            "_id": 1,
            "customer._id": 1,
            "customer.name": 1,
            "description": 1,
            "end": 1,
            "start": 1,
            "title": 1
        }
    }
])

This query is getting incredibly slow when the collections are growing in size. With 1000 tasks and 20 customers it already takes about 500ms to deliver result.

I'm aware, that this happens because the $lookup operator has to do a tablescan for each row that enters the aggregation pipeline's lookup stage.

I have tried to set indexes like described here: Poor lookup aggregation performance but that doesn't seem to have any impact.

My next guess was that the "sub"-pipeline in the $lookup stage is not capable of using indexes, so I replaced it with a simple

"$lookup": {
    "from": "customer",
    "localField": "customerId",
    "foreignField": "_id",
    "as": "customer"
}

But still the indexes are not used or don't have any impact on performance. (To be honest I don't know which of both is the case since .explain() won't work with aggregation pipelines.)

I have tried the following indexes:

Ascending, desecending, hashed and text index on customerId
Ascending, desecending, hashed and text index on customer.name

I'm grateful for any ideas on what I'm doing wrong or how I could achive the same thing with a better aggregation pipeline.

Additional info: I'm using a three member replica set. I'm on MongoDB 4.0.

Please note: I'm aware that I'm using a non-relational database to achieve highly relational objectives, but in this project MongoDB was our choice due to it's ChangeStream feature. If anybody knows a different database with a comparable feature (realtime push notifications on changes), which can be run on-premise (so Firebase drops out), I would love to hear about it!

Thanks in advance!

Upvotes: 7

Answers (2)

Stefan Holzapfel

Reputation: 532

I found out why my indexes weren't used.

I queried the collection using a different collation than the collection's own collation. But the id indexes on a collection are always implemented using the collections default collation.

Therefore the indexes were not used.

I changed the collection's collation to the same as for the queries and now the query takes just a fraction of the time (but still slow :)).

(Yes you have to recreate the collections to change the collation, no on-the-fly change is possible.)

Upvotes: 3

Rupen B

Reputation: 76

Have you considered having a single collection for customer with tasks as an embedded array in each document? That way, you would be able to index search on both customer and task fields.

Upvotes: 0

Incredibly slow query performance with $lookup and &quot;sub&quot; aggregation pipeline

Answers (2)

Related Questions

Incredibly slow query performance with $lookup and "sub" aggregation pipeline