Reputation: 321
I want to write a simple query that gives me the user with the most followers that has the timezone brazil and has tweeted 100 or more times:
this is my line :
pipeline = [{'$match':{"user.statuses_count":{"$gt":99},"user.time_zone":"Brasilia"}},
{"$group":{"_id": "$user.followers_count","count" :{"$sum":1}}},
{"$sort":{"count":-1}} ]
I adapted it from a practice problem.
This was given as an example for the structure :
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
"friends_count" : 145,
"profile_sidebar_fill_color" : "E5507E",
"location" : "Ireland :)",
"verified" : false,
"follow_request_sent" : null,
"favourites_count" : 1,
"profile_sidebar_border_color" : "CC3366",
"profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
"geo_enabled" : false,
"created_at" : "Sun May 03 19:51:04 +0000 2009",
"description" : "",
"time_zone" : null,
"url" : null,
"screen_name" : "Catherinemull",
"notifications" : null,
"profile_background_color" : "FF6699",
"listed_count" : 77,
"lang" : "en",
"profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
"statuses_count" : 2475,
"following" : null,
"profile_text_color" : "362720",
"protected" : false,
"show_all_inline_media" : false,
"profile_background_tile" : true,
"name" : "Catherine Mullane",
"contributors_enabled" : false,
"profile_link_color" : "B40B43",
"followers_count" : 169,
"id" : 37486277,
"profile_use_background_image" : true,
"utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}
Can anybody spot my mistakes?
Upvotes: 3
Views: 4168
Reputation: 103375
Suppose you have a couple of sample documents with the minimum test case. Insert the test documents to a collection in mongoshell:
db.collection.insert([
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"user" : {
"friends_count" : 145,
"statuses_count" : 457,
"screen_name" : "Catherinemull",
"time_zone" : "Brasilia",
"followers_count" : 169,
"id" : 37486277
},
"id" : NumberLong(22819398300)
},
{
"_id" : ObjectId("52fd2490bac3fa1975477702"),
"user" : {
"friends_count" : 145,
"statuses_count" : 12334,
"time_zone" : "Brasilia",
"screen_name" : "marble",
"followers_count" : 2597,
"id" : 37486278
},
"id" : NumberLong(22819398301)
}])
For you to get the user with the most followers that is in the timezone "Brasilia"
and has tweeted 100
or more times, this pipeline achieves the desired result but doesn't use the $group
operator:
pipeline = [
{
"$match": {
"user.statuses_count": {
"$gt":99
},
"user.time_zone": "Brasilia"
}
},
{
"$project": {
"followers": "$user.followers_count",
"screen_name": "$user.screen_name",
"tweets": "$user.statuses_count"
}
},
{
"$sort": {
"followers": -1
}
},
{"$limit" : 1}
]
Pymongo Output:
{u'ok': 1.0,
u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
u'followers': 2597,
u'screen_name': u'marble',
u'tweets': 12334}]}
The following aggregation pipeline will will also give you the desired result. In the pipeline, the first stage is the $match
operator which filters those documents where the user has got the timezone
field value "Brasilia"
and has a tweet count (represented by the statuses_count
) greater than or equal to 100 matched via the $gte
comparison operator.
The second pipeline stage has the $group
operator which groups the filtered documents by the specified identifier expression which is the $user.id
field and applies the accumulator expression $max
to each group on the $user.followers_count
field to get the greatest number of followers for each user. The system variable $$ROOT
which references the root document, i.e. the top-level document, currently being processed in the $group
aggregation pipeline stage, is added to an extra array field for use later on. This is achieved by using the $addToSet
array operator.
The next pipeline stage $unwinds
to output a document for each element in the data
array for processing in the next step.
The following pipeline step, $project
, then transforms each document in the stream, by adding new fields which have values from the previous stream.
The last two pipeline stages $sort
and $limit
reorders the document stream by the specified sort key followers
and returns one document which contains the user with the highest number of followers.
You final aggregation pipeline thus should look like this:
db.collection.aggregate([
{
'$match': {
"user.statuses_count": { "$gte": 100 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"max_followers": { "$max": "$user.followers_count" },
"data": { "$addToSet": "$$ROOT" }
}
},
{
"$unwind": "$data"
},
{
"$project": {
"_id": "$data._id",
"followers": "$max_followers",
"screen_name": "$data.user.screen_name",
"tweets": "$data.user.statuses_count"
}
},
{
"$sort": { "followers": -1 }
},
{
"$limit" : 1
}
])
Executing this in Robomongo gives you the result
/* 0 */
{
"result" : [
{
"_id" : ObjectId("52fd2490bac3fa1975477702"),
"followers" : 2597,
"screen_name" : "marble",
"tweets" : 12334
}
],
"ok" : 1
}
In python, the implementation should be essentially the same:
>>> pipeline = [
... {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
... {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROO
T" }}},
... {"$unwind": "$data"},
... {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets":
"$data.user.statuses_count"}},
... {"$sort": { "followers": -1 }},
... {"$limit" : 1}
... ]
>>>
>>> for doc in collection.aggregate(pipeline):
... print(doc)
...
{u'tweets': 12334.0, u'_id': ObjectId('52fd2490bac3fa1975477702'), u'followers': 2597.0, u'screen_name': u'marble'}
>>>
where
pipeline = [
{"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
{"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROOT" }}},
{"$unwind": "$data"},
{"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}},
{"$sort": { "followers": -1 }},
{"$limit" : 1}
]
Upvotes: 5