Lobstw
Lobstw

Reputation: 499

InvalidBSON consumes cursor and raises StopIteration. How to skip over the bad document?

I am cursoring through a collection with some bad datetime data in one of the collection's documents.

mongo_query = {}
mongo_projection = {"createdAt": True} # many more date columns ommitted here
mongo_cursor = collection.find(mongo_query,
                               projection=mongo_projection
                               no_cursor_timeout=True)

Iterating over the cursor documents:

for i in range(100):
    try:
        mongo_cursor.next()
    except InvalidBSON:
        pass

I would expect the iterator to continue after the InvalidBSON error is handled but after the error, .__next__() raises a StopIteration error and there are no more documents left in the cursor.

I have tried accessing the documents with for doc in mongo_cursor() as well as converting to a list list(mongo_cursor()) but everything fails in a similar way.

Is there a way of skipping over the bad data in a cursor in pymongo? Or is there a better way of handling this?

Upvotes: 2

Views: 1263

Answers (1)

Belly Buster
Belly Buster

Reputation: 8834

Pymongo will stop the iteration when it encounters invalid BSON. Ideally you should tidy up your invalid records rather than working around it; but maybe you don't know which are invalid?

The code below will work as stop-gap. Rather than get the full record, get just the _id, then do a find_one() on the record; you can put this in a try...except to flush out the invalid records.

As an aside, you can easily reproduce an InvalidBSON error in pymongo (for testing!!) by adding a date prior to the year 0001 using the Mongo shell:

db.mycollection.insertOne({'createdAt': new Date(-10000000000000)}) // valid in pymongo
db.mycollection.insertOne({'createdAt': new Date(-100000000000000)}) // **Not** valid in pymongo
db.mycollection.insertOne({'createdAt': new Date(-100000000)}) // valid in pymongo

pymongo code:

from pymongo import MongoClient
from bson.errors import InvalidBSON

db = MongoClient()['mydatabase']
collection = db['mycollection']

mongo_query = {}
mongo_date_projection = {"createdAt": True} # many more date columns ommitted here
mongo_projection = {"_id": 1} # many more date columns ommitted here
mongo_cursor = collection.find(mongo_query,
                               projection=mongo_projection,
                               no_cursor_timeout=True)

for record in mongo_cursor:
    record_id = record.get('_id')
    try:
        item = collection.find_one({'_id': record_id}, mongo_date_projection)
        print(item)
    except InvalidBSON:
        print(f'Record with id {record_id} contains invalid BSON')

gives an output similar to:

{'_id': ObjectId('5e6e1811c7c616e1ac58cbb3'), 'createdAt': datetime.datetime(1653, 2, 10, 6, 13, 20)}
Record with id 5e6e1818c7c616e1ac58cbb4 contains invalid BSON
{'_id': ObjectId('5e6e1a73c7c616e1ac58cbb5'), 'createdAt': datetime.datetime(1969, 12, 31, 23, 43, 20)}

Upvotes: 3

Related Questions