Reputation: 51
I want to remove duplicate data from my collection in MongoDB. How can I accomplish this?
Please refer to this example to understand my problem:
My collection name & questions are in this col/row as follows -
{
"questionText" : "what is android ?",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1234ffa7085"),
"userId" : "102"
},
{
"questionText" : "what is android ?",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"userId" : "102"
}
How do I remove the duplicate question by the same userId? Any help?
I'm using Python and MongoDB.
Upvotes: 1
Views: 5473
Reputation: 1
Since now the dropOps is deprecated. You can use pandas.
Upvotes: 0
Reputation: 66
IMPORTANT: The dropDups option was removed starting with MongoDB 3.x, so this solution is only valid for MongoDB versions 2.x and before. There is no direct replacement for the dropDups option. The answers to the question posed at http://stackoverflow.com/questions/30187688/mongo-3-duplicates-on-unique-index-dropdups offer some possible alternative ways to remove duplicates in Mongo 3.x.
Duplicate records can be removed from a MongoDB collection by creating a unique index on the collection and specifying the dropDups option.
Assuming the collection includes a field named record_id that uniquely identifies a record in the collection, the command to use to create a unique index and drop duplicates is:
db.collection.ensureIndex( { record_id:1 }, { unique:true, dropDups:true } )
Here is the trace of a session that shows the contents of a collection before and after creating a unique index with dropDups. Notice that duplicate records are no longer present after the index is created.
> db.pages.find()
{ “_id” : ObjectId(“52829c886602e2c8428d1d8c”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8d”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8e”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8f”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
>
> db.pages.ensureIndex( { scan_id:1, leaf_num:1 }, { unique:true, dropDups:true } )
>
> db.pages.find()
{ “_id” : ObjectId(“52829c886602e2c8428d1d8c”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8e”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
>
Upvotes: 3