Using PyMongo, I need to fetch the fields of another collection

I need to construct a query, using PyMongo, which gets data from two related collections in a MongoDB database.

Collection X has fields UserId, Name, and EmailId:

[
  {
    "UserId" :    "941AB",
    "Name" :      "Alex Andresson",
    "EmailId" :   "[email protected]"
  },
  {
    "UserId" :    "768CD",
    "Name" :      "Bryan Barnes",
    "EmailId" :   "[email protected]"
  }
]

Collection Y has fields UserId1, UserID2, and Rating:

[
  {
    "UserId1" :  "941AB",
    "UserId2" :  "768CD",
    "Rating" :   0.8
   }
]

I need to print the name and email id of UserId1 and UserId2 and the rating, something like this:

[
  {
    "UserId1" :    "941AB",
    "UserName1" :  "Alex Andresson"
    "UserEmail1" : "[email protected]",
    "UserId2" :    "768CD",
    "UserName2" :  "Bryan Barnes"
    "UserEmail2" : "[email protected]",
    "Rating":      0.8
  }
]

That means I need to fetch data from collection Y as well as the X one. I'm working with PyMongo right now and I have not been able to find its solution. Can somebody even give me a pseudocode on this concept or approach how to move forward with it.

Upvotes: 3

Answers (1)

Messa

Reputation: 25191

You need to do the join manually or use some library that will do it for you - maybe mongoengine.

Basically you need to find the ratings you are interested in and then find the users that are related to those ratings.

Example:

#!/usr/bin/env python3

import pymongo
from random import randrange

client = pymongo.MongoClient()
db = client['test']

# clean collections
db['users'].drop()
db['ratings'].drop()

# insert data
user_count = 100
rating_count = 20

db['users'].insert_many([
    {'UserId': i, 'Name': 'John', 'EmailId': i}
    for i in range(user_count)])

db['ratings'].insert_many([
    {'UserId1': randrange(user_count), 'UserId2': randrange(user_count), 'Rating': i}
    for i in range(rating_count)])

# don't forget the indexes
db['users'].create_index('UserId')
# but it would be better if we used _id as the UserId

# if you want to make queries based on Rating value, then add also this index:
db['ratings'].create_index('Rating')

# now print ratings with users that have value 10+

# simple approach:
ratings = db['ratings'].find({'Rating': {'$gte': 10}})
for rating in ratings:
    u1 = db['users'].find_one({'UserId': rating['UserId1']})
    u2 = db['users'].find_one({'UserId': rating['UserId2']})
    print('Rating between {} (UserId {:2}) and {} (UserId {:2}) is {:2}'.format(
        u1['Name'], u1['UserId'], u2['Name'], u2['UserId'], rating['Rating']))

print('---')

# optimized approach:
ratings = list(db['ratings'].find({'Rating': {'$gte': 10}}))
user_ids = {r['UserId1'] for r in ratings}
user_ids |= {r['UserId2'] for r in ratings}
users = db['users'].find({'UserId': {'$in': list(user_ids)}})
users_by_id = {u['UserId']: u for u in users}
for rating in ratings:
    u1 = users_by_id.get(rating['UserId1'])
    u2 = users_by_id.get(rating['UserId2'])
    print('Rating between {} (UserId {:2}) and {} (UserId {:2}) is {:2}'.format(
        u1['Name'], u1['UserId'], u2['Name'], u2['UserId'], rating['Rating']))

Notice that the first approach calls one find for ratings and two finds per rating, but the second approach calls just three finds in total. This would cause a huge performance difference if you are accessing MongoDB over the network.

I recommend to use _id instead of UserId if possible for the users collection.

Of course this particular use case would be much easier with SQL database. If you are using MongoDB for performance and you have much more reads than writes then consider caching related users Name into the rating document.

Upvotes: 0

Using PyMongo, I need to fetch the fields of another collection

Answers (1)

Related Questions