Reputation: 691
I'm having some issues working with a very large MongoDB collection (19 million documents).
When I simply iterate over the collection, as below, PyMongo seems to give up after 10,593,454 documents. This seems to be the same even if I use skip(), the latter half of the collection seems programmatically inaccessible.
#!/usr/bin/env python
import pymongo
client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]
print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]
for ii, classification in enumerate(classification_collection.find(no_cursor_timeout=True)):
print "%s: created at %s" % (ii,classification["created_at"])
print "Done."
The script reports initially:
Collection contains 19036976 documents.
Eventually, the script completes, I get no errors, and I do get the "Done." message. But the last row printed is
10593454: created at 2013-12-12 02:17:35
All my records logged in just over the last 2 years, the most recent ones, seem inaccessible. Does anyone have any idea what is going on here? What can I do about this?
Upvotes: 4
Views: 1117
Reputation: 691
OK well thanks to this helpful article I found another way to page through the documents, which doesn't seem to be subject to this "missing data"/"timeout" issue. Essentially, you have to use find()
and limit()
and rely on the natural _id
ordering of your collection to retrieve the document in pages. Here's my revised code:
#!/usr/bin/env python
import pymongo
client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]
print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]
# get first ID
pageSize = 100000
first_classification = classification_collection.find_one()
completed_page_rows=1
last_id = first_classification["_id"]
# get the next page of documents (read-ahead programming style)
next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)
# keep getting pages until there are no more
while next_results.count()>0:
for ii, classification in enumerate(next_results):
completed_page_rows+=1
if completed_page_rows % pageSize == 0:
print "%s (id = %s): created at %s" % (completed_page_rows,classification["_id"],classification["created_at"])
last_id = classification["_id"]
next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)
print "\nDone.\n"
I hope that by writing up this solution this will help others who hit this issue.
Note: This updated listing also takes on the suggestions of @Takarii and @adam-comerford in the comments, I now retrieve only the fields I need (_id
comes by default), and I also print out the IDs for reference.
Upvotes: 1