Reputation: 185
I am using Python script (pymongo) to export collection from MongoDB and ingesting to other database. This workflow is scheduled to run once a day using Apache Airflow. Every time script run its exports whole collection and overwrite the whole data at target but I want to fetch only the changes made to collection in subsequent execution of script, especially new documents added to collection. I have read other related questions but there "change streams" is suggested as solution but "change stream" is for real time. I want periodic updates for examples fetch the new documents added since the last execution of script. Do I have to download and scan the whole new updated collection and compare it with the old collection?
Upvotes: 1
Views: 408
Reputation: 250
Create a lookup table or collection where it saves the last run time and if the documents in the collection have timestamp then save the timestamp and _id in the very same lookup table.
If there aren't any timestamps in the documents then you can use the _id but the object ids in increasing order here are because the spec says that
time|machine|pid|inc
is the format for creating the ObjectId.
There is already a time component in the ObjectId, but that is in seconds. The Date type in Mongo is the representation of the number of milliseconds from the epoch, which will give you some more precision for figuring out the time of insertion.
I recommend to use a counter in the form of Sequence numbers if you need absolute precision beyond milliseconds and store the last sequence and the next run get query it by greater than to only get the delta data.
Upvotes: 1