Reputation: 3483
I need to fetch a large number (e.g. 100 million) of documents from a mongo (v3.2.10) collection (using Pymongo 3.3.0) and iterate over them. The iteration will take several days, and I often run into an exception due to a timed out cursor.
In my case I need to sleep for unpredictable amounts of time as I iterate. So for example I might need to: - fetch 10 documents - sleep for 1 seconds - fetch 1000 documents - sleep for 4 hours - fetch 1 document etc
I know I can:
It seems like a nice solution would be a way to 'touch' the cursor to keep it alive. So for example I'd break up a long sleep into shorter intervals and touch the cursor between each interval.
I didn't see a way to do this via pymongo, but I'm wondering if anyone knows for sure whether it's possible.
Upvotes: 2
Views: 8305
Reputation: 100
For me not even no_cursor_timeout=True
worked, so I created a function that saves the data from the cursor in a temporary file and than yields the documents back to the caller from the file.
from tempfile import NamedTemporaryFile
import pickle
import os
def safely_read_from_cursor(cursor):
# save data in a local file
with NamedTemporaryFile(suffix='.pickle', prefix='data_', delete=False) as data_file, cursor:
for count, doc in enumerate(cursor, 1):
pickle.dump(doc, data_file)
# open file again and iterate over data
with open(data_file.name, mode="rb") as data_file:
for _ in range(count):
yield pickle.load(data_file)
# remove temporary file
os.remove(data_file.name)
Upvotes: 2
Reputation: 24007
For sure, it is not possible, what you want is feature SERVER-6036, which is unimplemented.
For such a long-running task I recommend a query on an indexed field. E.g. if your documents all have a timestamp "ts":
documents = list(collection.find().sort('ts').limit(1000))
for doc in documents:
# ... process doc ...
while True:
ids = set(doc['_id'] for doc in documents)
cursor = collection.find({'ts': {'$gte': documents[-1]['ts']}})
documents = list(cursor.limit(1000).sort('ts'))
if not documents:
break # All done.
for doc in documents:
# Avoid overlaps
if doc['_id'] not in ids:
# ... process doc ...
This code iterates the cursor completely, so it doesn't time out, then processes 1000 documents, then repeats for the next 1000.
Second idea: configure your server with a very long cursor timeout:
mongod --setParameter cursorTimeoutMillis=21600000 # 6 hrs
Third idea: you can be more certain, though not completely certain, that you'll close a non-timeout cursor by using it in a with
statement:
cursor = collection.find(..., no_cursor_timeout=True)
with cursor:
# PyMongo will try to kill cursor on server
# if you leave this block.
for doc in cursor:
# do stuff....
Upvotes: 10