Alex Bowyer
Alex Bowyer

Reputation: 691

Why is data missing when processing large MongoDB collections in PyMongo? What can I do about it?

I'm having some issues working with a very large MongoDB collection (19 million documents).

When I simply iterate over the collection, as below, PyMongo seems to give up after 10,593,454 documents. This seems to be the same even if I use skip(), the latter half of the collection seems programmatically inaccessible.

#!/usr/bin/env python
import pymongo

client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]

print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]

for ii, classification in enumerate(classification_collection.find(no_cursor_timeout=True)):
  print "%s: created at %s" % (ii,classification["created_at"])

print "Done."

The script reports initially:

Collection contains 19036976 documents.

Eventually, the script completes, I get no errors, and I do get the "Done." message. But the last row printed is

10593454: created at 2013-12-12 02:17:35

All my records logged in just over the last 2 years, the most recent ones, seem inaccessible. Does anyone have any idea what is going on here? What can I do about this?

Upvotes: 4

Views: 1117

Answers (1)

Alex Bowyer
Alex Bowyer

Reputation: 691

OK well thanks to this helpful article I found another way to page through the documents, which doesn't seem to be subject to this "missing data"/"timeout" issue. Essentially, you have to use find() and limit() and rely on the natural _id ordering of your collection to retrieve the document in pages. Here's my revised code:

#!/usr/bin/env python
import pymongo

client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]

print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]

# get first ID
pageSize = 100000
first_classification = classification_collection.find_one()
completed_page_rows=1
last_id = first_classification["_id"]

# get the next page of documents (read-ahead programming style)
next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)

# keep getting pages until there are no more
while next_results.count()>0:
  for ii, classification in enumerate(next_results):
    completed_page_rows+=1
    if completed_page_rows % pageSize == 0:
      print "%s (id = %s): created at %s" % (completed_page_rows,classification["_id"],classification["created_at"])
    last_id = classification["_id"]
  next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)

print "\nDone.\n"

I hope that by writing up this solution this will help others who hit this issue.

Note: This updated listing also takes on the suggestions of @Takarii and @adam-comerford in the comments, I now retrieve only the fields I need (_id comes by default), and I also print out the IDs for reference.

Upvotes: 1

Related Questions