Reputation: 1397
I am trying to make a healthcheck on references in one of my collections
. so to see if objects referenced to still exist and if not I want to delete that _id
in the array
I haven't found anything to that so my idea is to get the reversed
result of a $lookup
Is it possible to get the reversed result of a lookup in MongoDB?
Here is an example of a collection and its taskList
with references to the tasks
collection.
Now I want to delete all the id's in there that do not have an existing result in the tasks
collection.
How I solve it right now which is tons of queries:
taskList
Upvotes: 1
Views: 471
Reputation: 8695
I think this does what you want, its ok even if you have big collections.
But its not an update you can do after that a $merge
stage, to the tasklists (if match on _id replace)(requires MongoDB >= 4.4) or you can do a $out
stage to another collection, and replace the tasklist collection.
Data in
db={
"tasklists": [
{
"_id": 1,
"tasklist": [
1,
2,
3,
4
]
},
{
"_id": 2,
"tasklist": [
5,
6,
7
]
}
],
"tasks": [
{
"_id": 1
},
{
"_id": 2
},
{
"_id": 3
},
{
"_id": 5
}
]
}
db.tasklists.aggregate([
{
"$lookup": {
"from": "tasks",
"let": {
"tasklist": "$tasklist"
},
"pipeline": [
{
"$match": {
"$expr": {
"$in": [
"$_id",
"$$tasklist"
]
}
}
}
],
"as": "valid"
}
},
{
"$addFields": {
"valid": {
"$map": {
"input": "$valid",
"as": "v",
"in": "$$v._id"
}
}
}
},
{
"$addFields": {
"tasklist": {
"$filter": {
"input": "$tasklist",
"as": "t",
"cond": {
"$in": [
"$$t",
"$valid"
]
}
}
}
}
},
{
"$unset": [
"valid"
]
}
])
Results (tasks 4,6,7 wasnt found in the task collection,and removed)
[
{
"_id": 1,
"tasklist": [
1,
2,
3
]
},
{
"_id": 2,
"tasklist": [
5
]
}
]
Edit
If you want to use index to do the $lookup
you can try this
Tasks have index on _id
so no need to make one, if you dont join on _id
make one.
db.tasklists.aggregate([
{
"$unwind": {
"path": "$tasklist"
}
},
{
"$lookup": {
"from": "tasks",
"localField": "tasklist",
"foreignField": "_id",
"as": "joined"
}
},
{
"$match": {
"$expr": {
"$gt": [
{
"$size": "$joined"
},
0
]
}
}
},
{
"$unset": [
"joined"
]
},
{
"$group": {
"_id": "$_id",
"tasklist": {
"$push": "$tasklist"
},
"afield": {
"$first": "$afield"
}
}
}
])
After that you can do $out
or $merge
with replace option.
But both lose the updated data if any while this was happening.
Only solution for this(if it is a problem) $merge
with pipeline,
You need to keep also in the pipeline above an extra array with the initial tasklist, so you remove the valid ones, to have the invalid ones, and then on merge with pipeline to filter the array, and just removed those invalid. (this is safe, from data loss)
I think the best approach instead of doing all those is to have an index on tasklist(multikey index) and when an _id is deleted from tasks,to delete the _id from the array in tasklist.With index it will be fast, so you dont need to check for invalid _ids.
Upvotes: 1
Reputation: 1397
So what I came to do now is a few stepped method.
This is quite fast but sicne the taskIds collected from Sets are currently way smaller than the entire amount of sets I imagine working with the $setDifference
operator mentioned by eol will be faster once I get that many references.
let taskIdsInSets = []
// Get all referenced task ids
const result = await this.setSchema.aggregate([
{
'$project': {
'taskList': 1
}
}
])
// Map all elements in one row
result.forEach(set => taskIdsInSets.push(...set.taskList.map(x=> x.toString())))
// Delete duplicates of taskIds here
taskIdsInSets.filter((item, index) => taskIdsInSets.indexOf(item) != index)
// Get the existing task ids that are referenced in a Set
const result2 = await this.taskSchema.aggregate([
{
'$match': {
'_id': {
'$in': [...taskIdsInSets.map(x => Types.ObjectId(x.toString()))]
}
}
}, {
'$project': {
'_id': 1
}
}
])
let existingIdsInTasks = []
// Getting ids from result2 Object into
result2.forEach(set => existingIdsInTasks.push(set._id.toString()))
// Filtering out the ids that don't actually exist
let nonExistingTaskIds = taskIdsInSets.filter(x => existingIdsInTasks.indexOf(x) === -1);
// Deleting the ids that don't actually exist but are in Sets
const finalResult = await this.setSchema.updateMany(
{
$pullAll: {
taskList: [...nonExistingTaskIds.map(x => Types.ObjectId(x.toString()))]
}
})
console.log(finalResult)
return finalResult // returns the information how much got changed. unfortunately in mongoose there isn't the option to use findAndModify with `{new:true}` or atleast I didn't manage to make it work.
for some reason what the database returns neither matches the Mongo ObjectId nor strings so I have to do some castings there.
Upvotes: 0
Reputation: 24565
Afaik there's no other way than you described in order to achieve the desired outcome, but you can greatly simplify the second step to find the non-matching items. In fact it's the set difference between the taskList
-ids and the existing task
-ids.
So you could use the $setDifference
-operator to calculate that difference:
db.tasks.aggregate([
{
$group: {
_id: "null",
ids: {
"$addToSet": "$_id"
}
}
},
{
$project: {
nonMatchingTaskIds: {
$setDifference: [
[
"taskId1",
"taskId2",
"taskId7",
"taskId8"
],
"$ids"
]
}
}
}
])
Assuming your tasks
collection contains taskId1
, task2
(and other documents), but not taskId7
and taskId8
, the query will result in nonMatchingTaskIds
containing taskId7
and taskId8
.
Here's an example on mongoplayground: https://mongoplayground.net/p/75BpiGBJi3Q
Upvotes: 1