Extria
Extria

Reputation: 373

Iterating through collections and counting the number of same value appearances pymongo

I have similar data in 5 collections in mongodb as follows (documents)

{ 
   "_id" : ObjectId("53490030cf3b942d63cfbc7b"),
   "uNr" : "abdc123abcd",  
 }

I want to iterate through each collection and check if there is uNr match in any collection. If there is then add that uNr and count +1 to new table. For example if there is a match in 3 collections, that it should show {"uNr" : "abcd123", "count": "3"}

Upvotes: 0

Views: 387

Answers (1)

A. Jesse Jiryu Davis
A. Jesse Jiryu Davis

Reputation: 24009

If your total number of uNr values is small enough to fit in memory (at most a few million of them), you can total them client-side with a Counter and store them in a MongoDB collection:

from collections import Counter
from pymongo import MongoClient, InsertOne

db = MongoClient().my_database

counts = Counter()

for collection in [db.collection1,
                   db.collection2,
                   db.collection3]:
    for doc in collection.find():
        counts[doc['uNr']] += 1

# Empty the target collection.
db.counts.delete_many({})

inserts = [InsertOne({'_id': uNr, 'n': cnt}) for uNr, cnt in counts.items()]
db.counts.bulk_write(inserts)

Otherwise, query a thousand uNr values at a time and update counts in a separate collection:

from pymongo import MongoClient, UpdateOne, ASCENDING

db = MongoClient().my_database

# Empty the target collection.
db.counts.delete_many({})
db.counts.create_index([('uNr', ASCENDING)])

for collection in [db.collection1,
                   db.collection2,
                   db.collection3]:
    cursor = collection.find(no_cursor_timeout=True)
    # "with" statement helps ensure cursor is closed, since the server will
    # never auto-close it.
    with cursor:
        updates = []
        for doc in cursor:
            updates.append(UpdateOne({'_id': doc['uNr']},
                                     {'$inc': {'n': 1}},
                                     upsert=True))

            if len(updates) == 1000:
                db.counts.bulk_write(updates)
                updates = []

        if updates:
            # Last batch.
            db.counts.bulk_write(updates)

Upvotes: 1

Related Questions