Jack
Jack

Reputation: 135

Remove duplicates by field based on secondary field

I have a use case where I am working with objects that appear as such:

{
  "data": {
    "uuid": 0001-1234-5678-9101
  },
  "organizationId": 10192432,
  "lastCheckin": 2022-03-19T08:23:02.435+00:00
}

Due to some old bugs in our application, we've accumulated many duplicates for these items in the database. The origin of the duplicates has been resolved in an upcoming release, but I need to ensure that prior to the release there are no such duplicates because the release includes a unique constraint on the "data.uuid" property.

I am trying to delete records based on the following criteria:

Any duplicate record based on "data.uuid" WHERE lastCheckin is NOT the most recent OR organizationId is missing.

Unfortunately, I am rather new to using MongoDB and do not know how to express this in a query. I have tried aggregated to obtain the duplicate records and, while I've been able to do so, I have so far been unable to exclude the records in each duplicate group containing the most recent "lastCheckin" value or even include "organizationId" as a part of the aggregation. Here's what I came up with:

db.collection.aggregate([
  { $match: {
    "_id": { "$ne": null },
    "count": { "$gt": 1 }
  }},
  { $group: {
    _id: "$data.uuid",
    "count": {
      "$sum": 1
    }
  }},
  { $project: {
    "uuid": "$_id", 
    "_id": 0
  }}
])

The above was mangled together based on various other stackoverflow posts describing the aggregation of duplicates. I am not sure whether this is the right way to approach this problem. One immediate problem that I can identify is that simply getting the "data.uuid" property without any additional criteria allowing me to identify the invalid duplicates makes it hard to envision a single query that can delete the invalid records without taking the valid records. Thanks for any help.

Upvotes: 0

Views: 173

Answers (1)

Charchit Kapoor
Charchit Kapoor

Reputation: 9284

I am not sure if this is possible via a single query, but this is how I would approach it, first sort the documents by lastCheckIn and then group the documents by data.uuid, like this:

db.collection.aggregate([
  {
    $sort: {
      lastCheckIn: -1
    }
  },
  {
    $group: {
      _id: "$data.uuid",
      "docs": {
        "$push": "$$ROOT"
      }
    }
  },
]);

Playground link.

Once you have these results, you can filter out the documents, according to your criteria, which you want to delete and collect their _id. The documents per group will be sorted by lastCheckIn in descending order, so filtering should be easy.

Finally, delete the documents, using this query:

db.collection.remove({_id: { $in: [\\ array of _ids collected above] }});

Upvotes: 1

Related Questions