Simon Lang
Simon Lang

Reputation: 42645

Dedupe MongoDB Collection

I'm new to NoSQL, so sorry if this is very basic. Let's say I have the following collection:

{
    a: 1,
    b: 2,
    c: 'x'
},
{
    a: 1,
    b: 2,
    c: 'y'
},
{
    a: 1,
    b: 1,
    c: 'y'
}

I would like to run a "Dedupe" query on anything that matches:

{
    a: 1,
    b: 2
    ... (any other properties are ignored) ...
},

So after the query is run, either of the following remaining in the collection would be fine:

{
    a: 1,
    b: 2,
    c: 'y'
},
{
    a: 1,
    b: 1,
    c: 'y'
}

OR

{
    a: 1,
    b: 2,
    c: 'x'
},
{
    a: 1,
    b: 1,
    c: 'y'
}

Just so long as there's only one document with a==1 and b==2 remaining.

Upvotes: 1

Views: 6912

Answers (3)

joshlsullivan
joshlsullivan

Reputation: 1500

This answer hasn't been updated in a while. It took me a while to figure this out. First, using the Mongo CLI, connect to the database and create an index on the field which you want to be unique. Here is an example for users with a unique email address:

db.users.createIndex({ "email": 1 }, { unique: true })

The 1 creates the index, along with the existing _id index automatically created.

Now when you run create or save on an object, if that email exists, Mongoose will through a duplication error.

Upvotes: 1

Sergio Tulentsev
Sergio Tulentsev

Reputation: 230316

I don't know of any commands that will update your collection in-place, but you can certainly do it via temp storage.

  1. Group your documents by your criteria (fields a and b)
  2. For each group pick any document from it. Save it to temp collection tmp. Discard the rest of the group.
  3. Overwrite original collection with documents from tmp.

You can do this with MapReduce or upcoming Aggregation Framework (currently in unstable branch).

I decided not to write code here as it would take the joy of learning away from you. :)

Upvotes: -1

dcrosta
dcrosta

Reputation: 26258

If you always want to ensure that only one document has any given a, b combination, you can use a unique index on a and b. When creating the index, you can give the dropDups option, which will remove all but one duplicate:

db.collection.ensureIndex({a: 1, b: 1}, {unique: true, dropDups: true})

Upvotes: 7

Related Questions