pymongo cursor 'touch' to avoid timeout

Question

I need to fetch a large number (e.g. 100 million) of documents from a mongo (v3.2.10) collection (using Pymongo 3.3.0) and iterate over them. The iteration will take several days, and I often run into an exception due to a timed out cursor.

In my case I need to sleep for unpredictable amounts of time as I iterate. So for example I might need to: - fetch 10 documents - sleep for 1 seconds - fetch 1000 documents - sleep for 4 hours - fetch 1 document etc

I know I can:

disable timeouts entirely, but I'd like to avoid that if possible since it's nice to have the cursors cleaned up for me if my code stops functioning entirely
decrease my cursor's batch_size but this won't help if e.g. I need to sleep for 4 hours as in the example above

It seems like a nice solution would be a way to 'touch' the cursor to keep it alive. So for example I'd break up a long sleep into shorter intervals and touch the cursor between each interval.

I didn't see a way to do this via pymongo, but I'm wondering if anyone knows for sure whether it's possible.

A. Jesse Jiryu Davis · Accepted Answer

For sure, it is not possible, what you want is feature SERVER-6036, which is unimplemented.

For such a long-running task I recommend a query on an indexed field. E.g. if your documents all have a timestamp "ts":

documents = list(collection.find().sort('ts').limit(1000))
for doc in documents:
    # ... process doc ...

while True:
    ids = set(doc['_id'] for doc in documents)
    cursor = collection.find({'ts': {'$gte': documents[-1]['ts']}})
    documents = list(cursor.limit(1000).sort('ts'))
    if not documents:
        break  # All done.
    for doc in documents:
        # Avoid overlaps
        if doc['_id'] not in ids:
            # ... process doc ...

This code iterates the cursor completely, so it doesn't time out, then processes 1000 documents, then repeats for the next 1000.

Second idea: configure your server with a very long cursor timeout:

mongod --setParameter cursorTimeoutMillis=21600000  # 6 hrs

Third idea: you can be more certain, though not completely certain, that you'll close a non-timeout cursor by using it in a with statement:

cursor = collection.find(..., no_cursor_timeout=True)
with cursor:
    # PyMongo will try to kill cursor on server
    # if you leave this block.
    for doc in cursor:
        # do stuff....

pymongo cursor 'touch' to avoid timeout

Answers (2)

Related Questions

pymongo cursor &#39;touch&#39; to avoid timeout

Answers (2)

Related Questions

pymongo cursor 'touch' to avoid timeout