Jan Tamm
Jan Tamm

Reputation: 117

Python-How to find duplicated name/document in mongo db?

I want to find the duplicated document in my mongodb based on name, I have the following code:

def Check_BFA_DB(options):
    issue_list=[]
    client = MongoClient(options.host, int(options.port))
        db = client[options.db]
        collection = db[options.collection]
        names = [{'$project': {'name':'$name'}}]
        name_cursor = collection.aggregate(names, cursor={})
        for name in name_cursor:
            issue_list.append(name)
            print(name)

It will print all names, how can I print only the duplicated ones?

Appritiated for any help!

Upvotes: 1

Views: 1308

Answers (1)

Hagai
Hagai

Reputation: 703

The following query will show only duplicates:

db['collection_name'].aggregate([{'$group': {'_id':'$name', 'count': {'$sum': 1}}}, {'$match': {'count': {'$gt': 1}}}])

How it works:

Step 1: Go over the whole collection, and group the documents by the property called name, and for each name count how many times it is used in the collection.

Step 2: filter (using the keyword match) only documents in which the count is greater than 1 (the gt operator).

An example (written for mongo shell, but can be easily adapted for python):

db.a.insert({name: "name1"})
db.a.insert({name: "name1"})
db.a.insert({name: "name2"})
db.a.aggregate([{"$group": {_id:"$name", count: {"$sum": 1}}}, {$match: {count: {"$gt": 1}}}])

Result is { "_id" : "name1", "count" : 2 }

So your code should look something like this:

def Check_BFA_DB(options):
    issue_list=[]
    client = MongoClient(options.host, int(options.port))
    db = client[options.db]
    name_cursor = db[options.collection].aggregate([
        {'$group': {'_id': '$name', 'count': {'$sum': 1}}},
        {'$match': {'count': {'$gt': 1}}}
        ])

    for document in name_cursor:
        name = document['_id']
        issue_list.append(name)
        print(name)

BTW (not related to the question), python naming convention for function names is lowercase letters, so you might want to call it check_bfa_db()

Upvotes: 2

Related Questions