Reputation: 14033
Below is my indexes,
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "spider.pages",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"url" : 1
},
"unique" : true,
"ns" : "spider.pages",
"name" : "url_1"
},
{
"v" : 1,
"key" : {
"parsed" : 1
},
"ns" : "spider.pages",
"name" : "parsed_1"
},
{
"v" : 1,
"key" : {
"date" : 1,
"category" : 1
},
"ns" : "spider.pages",
"name" : "date_1_category_1"
},
{
"v" : 1,
"key" : {
"indexed" : 1
},
"ns" : "spider.pages",
"name" : "indexed_1"
},
{
"v" : 1,
"key" : {
"link_extracted" : 1
},
"ns" : "spider.pages",
"name" : "link_extracted_1"
}
]
From my python command line interpreter I get the following,
>>> [item for item in pages.find({u'link_extracted': 0}, { u'_id':1}).sort(u'link_extracted', 1).limit(10)]
[{u'_id': ObjectId('53f988d820ba2709e89a1dc2')}, {u'_id': ObjectId('53f988d820ba270a1a9a1dbd')}, {u'_id': ObjectId('53f988e720ba2708fe9a1de4')}, {u'_id': ObjectId('53f994b620ba2706099a231e')}, {u'_id': ObjectId('53f988d820ba270bb49a1d10')}, {u'_id': ObjectId('53f994b720ba2706099a2320')}, {u'_id': ObjectId('53f9918720ba2708fe9a1fab')}, {u'_id': ObjectId('53f9949b20ba270bb49a215a')}, {u'_id': ObjectId('53f78ee420ba27220010098d')}, {u'_id': ObjectId('53f78ee620ba2721ed79d317')}]
>>> [item for item in pages.find({u'link_extracted': 0}, { u'_id':1}).sort(u'indexed', 1).limit(10)]
[{u'_id': ObjectId('53fb38c420ba27327b725aa9')}, {u'_id': ObjectId('53fb334d20ba2715f87265c2')}, {u'_id': ObjectId('53fb38f520ba2715f872674c')}, {u'_id': ObjectId('53fb38f520ba27327b725abe')}, {u'_id': ObjectId('53fb3eab20ba273348725c0c')}, {u'_id': ObjectId('53fafc1920ba27149b7257fa')}, {u'_id': ObjectId('53fafc1620ba27149b7257f7')}, {u'_id': ObjectId('53fafc1520ba27149b7257f6')}, {u'_id': ObjectId('53fb38f020ba2715f8726748')}, {u'_id': ObjectId('53fb38ef20ba2732d8725a9a')}]
>>> [item for item in pages.find({u'link_extracted': 0}, { u'_id':1}).sort(u'url', 1).limit(10)]
[{u'_id': ObjectId('53f848d920ba27319c4338ef')}, {u'_id': ObjectId('53f810e120ba27222952d374')}, {u'_id': ObjectId('53f810e120ba27222952d373')}, {u'_id': ObjectId('53f80bd220ba27222d52caef')}, {u'_id': ObjectId('53f80bd220ba27222d52caf0')}, {u'_id': ObjectId('53f823c220ba27222952d922')}, {u'_id': ObjectId('53f84c7720ba2731964338ff')}, {u'_id': ObjectId('53f911f620ba27458f434158')}, {u'_id': ObjectId('53f8163c20ba27222952d4cb')}, {u'_id': ObjectId('53f8162c20ba27222952d4c1')}]
But when I type the following I get an endless delay!
>>> [item for item in pages.find({u'link_extracted': 0}, { u'_id':1}).sort(u'date', 1).limit(10)] # Endless wait
I have 0.6 million of documents and each has date property.
Upvotes: 0
Views: 102
Reputation: 17960
The reason is that because you're using link_extracted
in your query, it is skipping the date
index and attempting to sort the result set based on date. This is slow when there are a large number of records.
You can solve this by using a composite index:
db.pages.ensureIndex({link_extracted: 1, date: 1})
As is always the case with indexes this will add some memory overhead, and it will add some processing overhead as the index is being created. You can see more details on MongoDB and index memory usage here: MongoDB index/RAM relationship
Upvotes: 1