Dewsworld
Dewsworld

Reputation: 14063

How to optimize find by date query in mongo

I have a collection with 0.6 millions of documents. Mostly the documents are structured like below,

{
    "_id" : ObjectId("53d86ef920ba274d5e4c8683"),
    "checksum" : "2856caa9490e5c92aedde91330964488",
    "content" : "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"bn-bd\" lang=\"bn-bd\" dir=\"ltr\" " />\n  <link rel=\"stylesheet\" href=\"/templates/beez_20/css/position.css\" type=\"text/css\" media=\"screen,projection\ef=\"/index.php/bn/contact-bangla/2013-0</body>\r\n</html>",
    "date" : ISODate("2014-07-29T15:57:11.886Z"),
    "filtered_content" : "",
    "indexed" : true,
    "category": 'raw',
    "link_extracted" : 1,
    "parsed" : true,
    "title" : "Constituency 249_10th_En",
    "url" : "http://www.somesite.com.bd/index.php/bn/bangla/2014-03-23-11-45-04?layout=edit&id=2143"
}

All the documents have the date attribute with them. Now when I write the query below I get an indefinite time of delay to display the result.

from pymongo import Connection
import datetime

con = Connection()
db = con.spider
pages = db.pages

today = datetime.datetime.combine( datetime.date.today(), datetime.datetime.min.time() )

c = pages.find({ u'category': 'news', u'date': {u'$gt': today } }, {u'title': 1, '_id': 0} )

for item in c:
    print item

Indexes are,

_id, url, parsed

How can I improve the performance for this query limiting to an acceptable amount of time? Any solid answer, suggestions is appreciated!

Upvotes: 1

Views: 3127

Answers (1)

hughdbrown
hughdbrown

Reputation: 49033

It looks like adding an index on category and date would help.

pages.createIndex({'date': 1, 'category': 1});

In pymongo, the index creation would look more like this:

keys = [
    ("date", pymongo.ASCENDING),
    ("category", pymongo.ASCENDING)
]
pages.create_index(keys)

The most likely options you would be interested in are:

name: custom name to use for this index - if none is given, a name will be generated
unique: if True creates a unique constraint on the index

I don't expect that date/category would be unique, though. Giving the index a name seems a good practice.

Upvotes: 5

Related Questions