Branchverse
Branchverse

Reputation: 1397

Mongoose reverse lookup and delete

I am trying to make a healthcheck on references in one of my collections. so to see if objects referenced to still exist and if not I want to delete that _id in the array

I haven't found anything to that so my idea is to get the reversed result of a $lookup

Is it possible to get the reversed result of a lookup in MongoDB?

Here is an example of a collection and its taskList with references to the tasks collection.

Now I want to delete all the id's in there that do not have an existing result in the tasks collection.

How I solve it right now which is tons of queries:

  1. get all the ids from taskList
  2. Send a query for every single one of them to see if there is no match with the task collection
  3. Send a query to pull that empty reference out of the array

Upvotes: 1

Views: 471

Answers (3)

Takis
Takis

Reputation: 8695

I think this does what you want, its ok even if you have big collections.

But its not an update you can do after that a $merge stage, to the tasklists (if match on _id replace)(requires MongoDB >= 4.4) or you can do a $out stage to another collection, and replace the tasklist collection.

Test code here

Data in

db={
  "tasklists": [
    {
      "_id": 1,
      "tasklist": [
        1,
        2,
        3,
        4
      ]
    },
    {
      "_id": 2,
      "tasklist": [
        5,
        6,
        7
      ]
    }
  ],
  "tasks": [
    {
      "_id": 1
    },
    {
      "_id": 2
    },
    {
      "_id": 3
    },
    {
      "_id": 5
    }
  ]
}
db.tasklists.aggregate([
  {
    "$lookup": {
      "from": "tasks",
      "let": {
        "tasklist": "$tasklist"
      },
      "pipeline": [
        {
          "$match": {
            "$expr": {
              "$in": [
                "$_id",
                "$$tasklist"
              ]
            }
          }
        }
      ],
      "as": "valid"
    }
  },
  {
    "$addFields": {
      "valid": {
        "$map": {
          "input": "$valid",
          "as": "v",
          "in": "$$v._id"
        }
      }
    }
  },
  {
    "$addFields": {
      "tasklist": {
        "$filter": {
          "input": "$tasklist",
          "as": "t",
          "cond": {
            "$in": [
              "$$t",
              "$valid"
            ]
          }
        }
      }
    }
  },
  {
    "$unset": [
      "valid"
    ]
  }
])

Results (tasks 4,6,7 wasnt found in the task collection,and removed)

[
  {
    "_id": 1,
    "tasklist": [
      1,
      2,
      3
    ]
  },
  {
    "_id": 2,
    "tasklist": [
      5
    ]
  }
]

Edit

If you want to use index to do the $lookup you can try this

Test code here

Tasks have index on _id so no need to make one, if you dont join on _id make one.

db.tasklists.aggregate([
  {
    "$unwind": {
      "path": "$tasklist"
    }
  },
  {
    "$lookup": {
      "from": "tasks",
      "localField": "tasklist",
      "foreignField": "_id",
      "as": "joined"
    }
  },
  {
    "$match": {
      "$expr": {
        "$gt": [
          {
            "$size": "$joined"
          },
          0
        ]
      }
    }
  },
  {
    "$unset": [
      "joined"
    ]
  },
  {
    "$group": {
      "_id": "$_id",
      "tasklist": {
        "$push": "$tasklist"
      },
      "afield": {
        "$first": "$afield"
      }
    }
  }
])

After that you can do $out or $merge with replace option. But both lose the updated data if any while this was happening.

Only solution for this(if it is a problem) $merge with pipeline, You need to keep also in the pipeline above an extra array with the initial tasklist, so you remove the valid ones, to have the invalid ones, and then on merge with pipeline to filter the array, and just removed those invalid. (this is safe, from data loss)

I think the best approach instead of doing all those is to have an index on tasklist(multikey index) and when an _id is deleted from tasks,to delete the _id from the array in tasklist.With index it will be fast, so you dont need to check for invalid _ids.

Upvotes: 1

Branchverse
Branchverse

Reputation: 1397

So what I came to do now is a few stepped method. This is quite fast but sicne the taskIds collected from Sets are currently way smaller than the entire amount of sets I imagine working with the $setDifference operator mentioned by eol will be faster once I get that many references.

let taskIdsInSets = []
    // Get all referenced task ids
    const result = await this.setSchema.aggregate([
      {
        '$project': {
          'taskList': 1
        }
      }
    ])

    // Map all elements in one row
    result.forEach(set => taskIdsInSets.push(...set.taskList.map(x=> x.toString())))

    // Delete duplicates of taskIds here
    taskIdsInSets.filter((item, index) => taskIdsInSets.indexOf(item) != index)

    // Get the existing task ids that are referenced in a Set
    const result2 = await this.taskSchema.aggregate([
      {
        '$match': {
          '_id': {
            '$in': [...taskIdsInSets.map(x => Types.ObjectId(x.toString()))]
          }
        }
      }, {
        '$project': {
          '_id': 1
        }
      }
    ])

    let existingIdsInTasks = []
    // Getting ids from result2 Object into
    result2.forEach(set => existingIdsInTasks.push(set._id.toString()))

    // Filtering out the ids that don't actually exist
    let nonExistingTaskIds = taskIdsInSets.filter(x => existingIdsInTasks.indexOf(x) === -1);

    // Deleting the ids that don't actually exist but are in Sets
    const finalResult = await this.setSchema.updateMany(
      {
        $pullAll: {
          taskList: [...nonExistingTaskIds.map(x => Types.ObjectId(x.toString()))]
        }
      })
    console.log(finalResult)
    return finalResult // returns the information how much got changed. unfortunately in mongoose there isn't the option to use findAndModify with `{new:true}` or atleast I didn't manage to make it work.

for some reason what the database returns neither matches the Mongo ObjectId nor strings so I have to do some castings there.

Upvotes: 0

eol
eol

Reputation: 24565

Afaik there's no other way than you described in order to achieve the desired outcome, but you can greatly simplify the second step to find the non-matching items. In fact it's the set difference between the taskList-ids and the existing task-ids.

So you could use the $setDifference-operator to calculate that difference:

db.tasks.aggregate([
  {
    $group: {
      _id: "null",
      ids: {
        "$addToSet": "$_id"
      }
    }
  },
  {
    $project: {
      nonMatchingTaskIds: {
        $setDifference: [
          [
            "taskId1",
            "taskId2",
            "taskId7",
            "taskId8"
          ],
          "$ids"
        ]
      }
    }
  }
])

Assuming your tasks collection contains taskId1, task2 (and other documents), but not taskId7 and taskId8, the query will result in nonMatchingTaskIds containing taskId7 and taskId8.

Here's an example on mongoplayground: https://mongoplayground.net/p/75BpiGBJi3Q

Upvotes: 1

Related Questions