Why is data missing when processing large MongoDB collections in PyMongo? What can I do about it?

Question

I'm having some issues working with a very large MongoDB collection (19 million documents).

When I simply iterate over the collection, as below, PyMongo seems to give up after 10,593,454 documents. This seems to be the same even if I use skip(), the latter half of the collection seems programmatically inaccessible.

#!/usr/bin/env python
import pymongo

client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]

print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]

for ii, classification in enumerate(classification_collection.find(no_cursor_timeout=True)):
  print "%s: created at %s" % (ii,classification["created_at"])

print "Done."

The script reports initially:

Collection contains 19036976 documents.

Eventually, the script completes, I get no errors, and I do get the "Done." message. But the last row printed is

10593454: created at 2013-12-12 02:17:35

All my records logged in just over the last 2 years, the most recent ones, seem inaccessible. Does anyone have any idea what is going on here? What can I do about this?

Why is data missing when processing large MongoDB collections in PyMongo? What can I do about it?

Answers (1)

Related Questions