Reputation: 31
I need to construct a query, using PyMongo, which gets data from two related collections in a MongoDB database.
Collection X has fields UserId, Name, and EmailId:
[
{
"UserId" : "941AB",
"Name" : "Alex Andresson",
"EmailId" : "[email protected]"
},
{
"UserId" : "768CD",
"Name" : "Bryan Barnes",
"EmailId" : "[email protected]"
}
]
Collection Y has fields UserId1, UserID2, and Rating:
[
{
"UserId1" : "941AB",
"UserId2" : "768CD",
"Rating" : 0.8
}
]
I need to print the name and email id of UserId1 and UserId2 and the rating, something like this:
[
{
"UserId1" : "941AB",
"UserName1" : "Alex Andresson"
"UserEmail1" : "[email protected]",
"UserId2" : "768CD",
"UserName2" : "Bryan Barnes"
"UserEmail2" : "[email protected]",
"Rating": 0.8
}
]
That means I need to fetch data from collection Y as well as the X one. I'm working with PyMongo right now and I have not been able to find its solution. Can somebody even give me a pseudocode on this concept or approach how to move forward with it.
Upvotes: 3
Views: 823
Reputation: 25191
You need to do the join manually or use some library that will do it for you - maybe mongoengine.
Basically you need to find the ratings you are interested in and then find the users that are related to those ratings.
Example:
#!/usr/bin/env python3
import pymongo
from random import randrange
client = pymongo.MongoClient()
db = client['test']
# clean collections
db['users'].drop()
db['ratings'].drop()
# insert data
user_count = 100
rating_count = 20
db['users'].insert_many([
{'UserId': i, 'Name': 'John', 'EmailId': i}
for i in range(user_count)])
db['ratings'].insert_many([
{'UserId1': randrange(user_count), 'UserId2': randrange(user_count), 'Rating': i}
for i in range(rating_count)])
# don't forget the indexes
db['users'].create_index('UserId')
# but it would be better if we used _id as the UserId
# if you want to make queries based on Rating value, then add also this index:
db['ratings'].create_index('Rating')
# now print ratings with users that have value 10+
# simple approach:
ratings = db['ratings'].find({'Rating': {'$gte': 10}})
for rating in ratings:
u1 = db['users'].find_one({'UserId': rating['UserId1']})
u2 = db['users'].find_one({'UserId': rating['UserId2']})
print('Rating between {} (UserId {:2}) and {} (UserId {:2}) is {:2}'.format(
u1['Name'], u1['UserId'], u2['Name'], u2['UserId'], rating['Rating']))
print('---')
# optimized approach:
ratings = list(db['ratings'].find({'Rating': {'$gte': 10}}))
user_ids = {r['UserId1'] for r in ratings}
user_ids |= {r['UserId2'] for r in ratings}
users = db['users'].find({'UserId': {'$in': list(user_ids)}})
users_by_id = {u['UserId']: u for u in users}
for rating in ratings:
u1 = users_by_id.get(rating['UserId1'])
u2 = users_by_id.get(rating['UserId2'])
print('Rating between {} (UserId {:2}) and {} (UserId {:2}) is {:2}'.format(
u1['Name'], u1['UserId'], u2['Name'], u2['UserId'], rating['Rating']))
Notice that the first approach calls one find
for ratings and two find
s per rating, but the second approach calls just three find
s in total. This would cause a huge performance difference if you are accessing MongoDB over the network.
I recommend to use _id
instead of UserId
if possible for the users collection.
Of course this particular use case would be much easier with SQL database. If you are using MongoDB for performance and you have much more reads than writes then consider caching related users Name into the rating document.
Upvotes: 0