QA Intern
QA Intern

Reputation: 51

Check and remove duplicates in python MongoDB

I want to remove duplicate data from my collection in MongoDB. How can I accomplish this?

Please refer to this example to understand my problem:

My collection name & questions are in this col/row as follows -

{
"questionText" : "what is android ?",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1234ffa7085"),
"userId" : "102"
},

{
"questionText" : "what is android ?",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"userId" : "102"
}

How do I remove the duplicate question by the same userId? Any help?

I'm using Python and MongoDB.

Upvotes: 1

Views: 5473

Answers (2)

Mouhcine Toumi
Mouhcine Toumi

Reputation: 1

Since now the dropOps is deprecated. You can use pandas.

  1. select the fields you need from mongodb
  2. use pandas.DataFrame.duplicated to mark all duplicates as True except the first one
  3. remove them ( the ones marked as duplicated ) in the collection using their _ids

Upvotes: 0

mishri
mishri

Reputation: 66

IMPORTANT: The dropDups option was removed starting with MongoDB 3.x, so this solution is only valid for MongoDB versions 2.x and before. There is no direct replacement for the dropDups option. The answers to the question posed at http://stackoverflow.com/questions/30187688/mongo-3-duplicates-on-unique-index-dropdups offer some possible alternative ways to remove duplicates in Mongo 3.x.

Duplicate records can be removed from a MongoDB collection by creating a unique index on the collection and specifying the dropDups option.

Assuming the collection includes a field named record_id that uniquely identifies a record in the collection, the command to use to create a unique index and drop duplicates is:

db.collection.ensureIndex( { record_id:1 }, { unique:true, dropDups:true } )

Here is the trace of a session that shows the contents of a collection before and after creating a unique index with dropDups. Notice that duplicate records are no longer present after the index is created.

> db.pages.find()
{ “_id” : ObjectId(“52829c886602e2c8428d1d8c”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8d”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8e”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8f”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
>
> db.pages.ensureIndex( { scan_id:1, leaf_num:1 }, { unique:true, dropDups:true } )
>
> db.pages.find()
{ “_id” : ObjectId(“52829c886602e2c8428d1d8c”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8e”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
>

Upvotes: 3

Related Questions